Background

The file ‘WeatherEDA_202005_v002.Rmd’ contains exploratory data analysis for historical weather data as contained in METAR archives hosted by Iowa State University.

Data have been dowloaded, processed, cleaned, and integrated for several stations (airports) and years, with .rds files saved in “./RInputFiles/ProcessedMETAR”.

This module will perform initial modeling on the processed weather files. It builds on the previous ‘WeatherModeling_202006_v001.Rmd’ and ‘WeatherModeling_202006_v002.Rmd’ as well as leveraging functions in ‘WeatherModelingFunctions_v001.R’.

This file focuses on:

  1. Random forest classification of select locales using 2014-2019 data
  2. Random forest classification of locales in 2016
  3. Random forest regression of temperatures in select locales for 2014-2019
  4. XGB regression of temperatures in select locales for 2014-2019
  5. XGB classification of locales in 2016

There are numerous other models available in ‘WeatherModeling_202006_v002.Rmd’.

Data Availability

There are three main processed files available for further exploration:

metar_postEDA_20200617.rds and metar_postEDA_extra_20200627.rds

  • source (chr) - the reporting station and time
  • locale (chr) - the descriptive name for source
  • dtime (dttm) - the date-time for the observation
  • origMETAR (chr) - the original METAR associated with the observation at that source and date-time
  • year (dbl) - the year, extracted from dtime
  • monthint (dbl) - the month, extracted from dtime, as an integer
  • month (fct) - the month, extracted from dtime, as a three-character abbreviation (factor)
  • day (int) - the day of the month, extracted from dtime
  • WindDir (chr) - previaling wind direction in degrees, stored as a character since ‘VRB’ means variable
  • WindSpeed (int) - the prevailing wind speed in knots
  • WindGust (dbl) - the wind gust speed in knots (NA if there is no recorded wind gust at that hour)
  • predomDir (chr) - the predominant wind direction as NE-E-SE-S-SW-W-NW-N-VRB-000-Error
  • Visibility (dbl) - surface visibility in statute miles
  • Altimeter (dbl) - altimeter in inches of mercury
  • TempF (dbl) - the Fahrenheit temperature
  • DewF (dbl) - the Fahrenheit dew point
  • modSLP (dbl) - Sea-Level Pressure (SLP), adjusted to reflect that SLP is recorded as 0-1000 but reflects data that are 950-1050
  • cTypen (chr) - the cloud type of the nth cloud layer (FEW, BKN, SCT, OVC, or VV)
  • cLeveln (dbl) - the cloud height in feet of the nth cloud layer
  • isRain (lgl) - was rain occurring at the moment the METAR was captured?
  • isSnow (lgl) - was snow occurring at the moment the METAR was captured?
  • isThunder (lgl) - was thunder occurring at the moment the METAR was captured?
  • p1Inches (dbl) - how many inches of rain occurred in the past hour?
  • p36Inches (dbl) - how many inches of rain occurred in the past 3/6 hours (3-hour summaries at 3Z-9Z-15Z-21Z and 6-hour summaries at 6Z-12Z-18Z-24Z and NA at any other Z times)?
  • p24Inches (dbl) - how many inches of rain occurred in the past 24 hours (at 12Z, NA at all other times)
  • tempFHi (dbl) - the high temperature in the past 24 hours, in Fahrenheit (reported once per day)
  • tempFLo (dbl) - the low temperature in the past 24 hours, in Fahrenheit (reported once per day)
  • minHeight (dbl) - the minimum cloud height in feet (-100 means ‘no clouds’)
  • minType (fct) - amount of obscuration at the minimum cloud height (VV > OVC > BKN > SCT > FEW > CLR)
  • ceilingHeight (dbl) - the minimum cloud ceiling in feet (-100 means ‘no ceiling’)
  • ceilingType (fct) - amount of obscuration at the minimum ceiling height (VV > OVC > BKN)

metar_modifiedClouds_20200617.rds and metar_modifiedclouds_extra_20200627.rds

  • source (chr) - the reporting station and time
  • sourceName (chr) - the descriptive name for source
  • dtime (dttm) - the date-time for the observation
  • level (dbl) - cloud level (level 0 is inserted for every source-dtime as a base layer of clear)
  • height (dbl) - level height (height -100 is inserted for every source-dtime as a base layer of clear)
  • type (dbl) - level type (type CLR is inserted for every source-dtime as a base layer of clear)

metar_precipLists_20200617.rds and metar_precipLists_extra_20200627.rds

  • Contains elements for each of rain/snow/thunder for each of 2015/2016/2017
  • Each element contains a list and a tibble
  • The tibble is precipLength and contains precipitation by month as source-locale-month-hours-events
  • The list is precipList and contains data on each precipitation interval

Several mapping files are defined for use in plotting; tidyverse, lubridate, and caret are loaded; and the relevant functions are sourced:

# The process frequently uses tidyverse, lubridate, caret, and randomForest
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
# The main path for the files
filePath <- "./RInputFiles/ProcessedMETAR/"


# Sourcing functions
source("./WeatherModelingFunctions_v001.R")


# Descriptive names for key variables
varMapper <- c(source="Original source file name", 
               locale="Descriptive name",
               dtime="Date-Time (UTC)",
               origMETAR="Original METAR",
               year="Year",
               monthint="Month",
               month="Month", 
               day="Day of Month",
               WindDir="Wind Direction (degrees)", 
               WindSpeed="Wind Speed (kts)",
               WindGust="Wind Gust (kts)",
               predomDir="General Prevailing Wind Direction",
               Visibility="Visibility (SM)", 
               Altimeter="Altimeter (inches Hg)",
               TempF="Temperature (F)",
               DewF="Dew Point (F)", 
               modSLP="Sea-Level Pressure (hPa)", 
               cType1="First Cloud Layer Type", 
               cLevel1="First Cloud Layer Height (ft)",
               isRain="Rain at Observation Time",
               isSnow="Snow at Observation Time",
               isThunder="Thunder at Obsevation Time",
               tempFHi="24-hour High Temperature (F)",
               tempFLo="24-hour Low Temperature (F)",
               minHeight="Minimum Cloud Height (ft)",
               minType="Obscuration Level at Minimum Cloud Height",
               ceilingHeight="Minimum Ceiling Height (ft)",
               ceilingType="Obscuration Level at Minimum Ceiling Height", 
               hr="Hour of Day (Zulu time)",
               hrfct="Hour of Day (Zulu time)",
               hrBucket="Hour of Day (Zulu time) - rounded to nearest 3",
               locNamefct="Locale Name"
               )


# File name to city name mapper
cityNameMapper <- c(katl_2016="Atlanta, GA (2016)",
                    kbos_2016="Boston, MA (2016)", 
                    kdca_2016="Washington, DC (2016)", 
                    kden_2016="Denver, CO (2016)", 
                    kdfw_2016="Dallas, TX (2016)", 
                    kdtw_2016="Detroit, MI (2016)", 
                    kewr_2016="Newark, NJ (2016)",
                    kgrb_2016="Green Bay, WI (2016)",
                    kgrr_2016="Grand Rapids, MI (2016)",
                    kiah_2016="Houston, TX (2016)",
                    kind_2016="Indianapolis, IN (2016)",
                    klas_2014="Las Vegas, NV (2014)",
                    klas_2015="Las Vegas, NV (2015)",
                    klas_2016="Las Vegas, NV (2016)", 
                    klas_2017="Las Vegas, NV (2017)", 
                    klas_2018="Las Vegas, NV (2018)",
                    klas_2019="Las Vegas, NV (2019)",
                    klax_2016="Los Angeles, CA (2016)", 
                    klnk_2016="Lincoln, NE (2016)",
                    kmia_2016="Miami, FL (2016)", 
                    kmke_2016="Milwaukee, WI (2016)",
                    kmsn_2016="Madison, WI (2016)",
                    kmsp_2016="Minneapolis, MN (2016)",
                    kmsy_2014="New Orleans, LA (2014)",
                    kmsy_2015="New Orleans, LA (2015)",
                    kmsy_2016="New Orleans, LA (2016)", 
                    kmsy_2017="New Orleans, LA (2017)", 
                    kmsy_2018="New Orleans, LA (2018)",
                    kmsy_2019="New Orleans, LA (2019)",
                    kord_2014="Chicago, IL (2014)",
                    kord_2015="Chicago, IL (2015)",
                    kord_2016="Chicago, IL (2016)", 
                    kord_2017="Chicago, IL (2017)", 
                    kord_2018="Chicago, IL (2018)",
                    kord_2019="Chicago, IL (2019)",
                    kphl_2016="Philadelphia, PA (2016)", 
                    kphx_2016="Phoenix, AZ (2016)", 
                    ksan_2014="San Diego, CA (2014)",
                    ksan_2015="San Diego, CA (2015)",
                    ksan_2016="San Diego, CA (2016)",
                    ksan_2017="San Diego, CA (2017)",
                    ksan_2018="San Diego, CA (2018)",
                    ksan_2019="San Diego, CA (2019)",
                    ksat_2016="San Antonio, TX (2016)", 
                    ksea_2016="Seattle, WA (2016)", 
                    ksfo_2016="San Francisco, CA (2016)", 
                    ksjc_2016="San Jose, CA (2016)",
                    kstl_2016="Saint Louis, MO (2016)", 
                    ktpa_2016="Tampa Bay, FL (2016)", 
                    ktvc_2016="Traverse City, MI (2016)"
                    )

# File names in 2016, based on cityNameMapper
names_2016 <- grep(names(cityNameMapper), pattern="[a-z]{3}_2016", value=TRUE)

The main data will be from the metar_postEDA files. They are integrated below, cloud and ceiling heights are converted to factors, hour is added as both a factor/numeric variable, and locale is added as a factor variable:

# Main weather data
metarData <- readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_20200617.rds") %>%
    bind_rows(readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_extra_20200627.rds")) %>%
    mutate(orig_minHeight=minHeight, 
           orig_ceilingHeight=ceilingHeight, 
           minHeight=mapCloudHeight(minHeight), 
           ceilingHeight=mapCloudHeight(ceilingHeight), 
           hr=lubridate::hour(lubridate::round_date(dtime, unit="1 hour")),
           hrfct=factor(hr), 
           locNamefct=factor(str_replace(locale, pattern=" \\(\\d{4}\\)", replacement=""))
           )
glimpse(metarData)
## Observations: 437,417
## Variables: 46
## $ source             <chr> "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdtw_201...
## $ locale             <chr> "Detroit, MI (2016)", "Detroit, MI (2016)", "Det...
## $ dtime              <dttm> 2016-01-01 00:53:00, 2016-01-01 01:53:00, 2016-...
## $ origMETAR          <chr> "KDTW 010053Z 23012KT 10SM OVC020 00/M05 A3021 R...
## $ year               <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ monthint           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ month              <fct> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan...
## $ day                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ WindDir            <chr> "230", "230", "230", "240", "230", "220", "220",...
## $ WindSpeed          <int> 12, 12, 11, 14, 16, 13, 14, 16, 13, 16, 17, 13, ...
## $ WindGust           <dbl> NA, NA, NA, 23, 22, NA, 20, 20, NA, 22, NA, NA, ...
## $ predomDir          <fct> SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, ...
## $ Visibility         <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 8, 5, 7, 8, 10, ...
## $ Altimeter          <dbl> 30.21, 30.21, 30.19, 30.19, 30.18, 30.16, 30.14,...
## $ TempF              <dbl> 32.00, 32.00, 32.00, 30.92, 30.92, 32.00, 30.92,...
## $ DewF               <dbl> 23.00, 21.92, 21.02, 19.94, 19.94, 19.94, 19.94,...
## $ modSLP             <dbl> 1023.6, 1023.5, 1023.0, 1023.0, 1022.7, 1022.0, ...
## $ cType1             <chr> "OVC", "OVC", "OVC", "OVC", "OVC", "OVC", "OVC",...
## $ cType2             <chr> "", "", "", "", "", "", "", "", "", "OVC", "OVC"...
## $ cType3             <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType4             <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType5             <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType6             <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cLevel1            <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
## $ cLevel2            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 4500, 4000, ...
## $ cLevel3            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel4            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel5            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel6            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ isRain             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isSnow             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isThunder          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ p1Inches           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, N...
## $ p36Inches          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, 0, NA...
## $ p24Inches          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tempFHi            <dbl> NA, NA, NA, NA, 36, NA, NA, NA, NA, NA, NA, NA, ...
## $ tempFLo            <dbl> NA, NA, NA, NA, 31, NA, NA, NA, NA, NA, NA, NA, ...
## $ minHeight          <fct> Low, Low, Low, Low, Low, Low, Low, Low, Low, Low...
## $ minType            <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN...
## $ ceilingHeight      <fct> Low, Low, Low, Low, Low, Low, Low, Low, Low, Low...
## $ ceilingType        <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN...
## $ orig_minHeight     <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
## $ orig_ceilingHeight <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
## $ hr                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ hrfct              <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ locNamefct         <fct> "Detroit, MI", "Detroit, MI", "Detroit, MI", "De...

Random Forest Classification (Select Locales for 2014-2019)

Models are run on all 2014-2019 data for Chicago, Las Vegas, New Orleans, and San Diego:

# Create the subset for Chicago, Las Vegas, New Orleans, San Diego
sub_2014_2019_data <- metarData %>%
    filter(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"), 
           year %in% c(2014, 2015, 2016, 2017, 2018, 2019)
           ) %>%
    mutate(city=str_replace(locale, pattern=" .\\d{4}.", replacement=""), 
           hr=lubridate::hour(dtime)
           )

# Check that proper locales are included
sub_2014_2019_data %>% 
    count(city, locale)
## # A tibble: 24 x 3
##    city          locale                   n
##    <chr>         <chr>                <int>
##  1 Chicago, IL   Chicago, IL (2014)    8718
##  2 Chicago, IL   Chicago, IL (2015)    8728
##  3 Chicago, IL   Chicago, IL (2016)    8767
##  4 Chicago, IL   Chicago, IL (2017)    8740
##  5 Chicago, IL   Chicago, IL (2018)    8737
##  6 Chicago, IL   Chicago, IL (2019)    8750
##  7 Las Vegas, NV Las Vegas, NV (2014)  8739
##  8 Las Vegas, NV Las Vegas, NV (2015)  8727
##  9 Las Vegas, NV Las Vegas, NV (2016)  8770
## 10 Las Vegas, NV Las Vegas, NV (2017)  8664
## # ... with 14 more rows

The random forest model is run and cached:

# Run random forest for 2014-2019 data
rf_types_2014_2019_TDmcwha <- rfMultiLocale(sub_2014_2019_data, 
                                            vrbls=c("TempF", "DewF", 
                                                    "month", "hr",
                                                    "minHeight", "ceilingHeight", 
                                                    "WindSpeed", "predomDir", 
                                                    "modSLP"
                                                    ),
                                            locs=NULL, 
                                            locVar="city",
                                            pred="city",
                                            ntree=50, 
                                            seed=2006301420, 
                                            mtry=4
                                            )
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"
evalPredictions(rf_types_2014_2019_TDmcwha, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP", 
                keyVar="city"
                )

## # A tibble: 16 x 5
##    locale          predicted       correct     n     pct
##    <fct>           <fct>           <lgl>   <int>   <dbl>
##  1 Chicago, IL     Chicago, IL     TRUE    14700 0.939  
##  2 Chicago, IL     Las Vegas, NV   FALSE     243 0.0155 
##  3 Chicago, IL     New Orleans, LA FALSE     382 0.0244 
##  4 Chicago, IL     San Diego, CA   FALSE     329 0.0210 
##  5 Las Vegas, NV   Chicago, IL     FALSE     237 0.0153 
##  6 Las Vegas, NV   Las Vegas, NV   TRUE    14878 0.958  
##  7 Las Vegas, NV   New Orleans, LA FALSE     130 0.00837
##  8 Las Vegas, NV   San Diego, CA   FALSE     288 0.0185 
##  9 New Orleans, LA Chicago, IL     FALSE     383 0.0242 
## 10 New Orleans, LA Las Vegas, NV   FALSE     137 0.00865
## 11 New Orleans, LA New Orleans, LA TRUE    14746 0.931  
## 12 New Orleans, LA San Diego, CA   FALSE     567 0.0358 
## 13 San Diego, CA   Chicago, IL     FALSE     248 0.0157 
## 14 San Diego, CA   Las Vegas, NV   FALSE     454 0.0288 
## 15 San Diego, CA   New Orleans, LA FALSE     334 0.0212 
## 16 San Diego, CA   San Diego, CA   TRUE    14739 0.934

Even with a small forest (50 trees), the model is almost always separating Las Vegas, Chicago, San Diego, and New Orleans. While the climates are very different in these cities, it is striking that the model has so few misclassifications.

How do other cities map against these classifications?

# Predictions on 2014-2019 data
helperPredictPlot(rf_types_2014_2019_TDmcwha$rfModel, 
                  df=filter(mutate(metarData, hr=lubridate::hour(dtime)), 
                            !(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"))
                            ), 
                  predOrder=c("Chicago, IL", "San Diego, CA", "New Orleans, LA", "Las Vegas, NV")
                  )

## # A tibble: 104 x 4
##    locale             predicted           n    pct
##    <chr>              <fct>           <int>  <dbl>
##  1 Atlanta, GA (2016) Chicago, IL      3019 0.345 
##  2 Atlanta, GA (2016) Las Vegas, NV     791 0.0904
##  3 Atlanta, GA (2016) New Orleans, LA  4113 0.470 
##  4 Atlanta, GA (2016) San Diego, CA     828 0.0946
##  5 Boston, MA (2016)  Chicago, IL      7620 0.881 
##  6 Boston, MA (2016)  Las Vegas, NV     457 0.0528
##  7 Boston, MA (2016)  New Orleans, LA   339 0.0392
##  8 Boston, MA (2016)  San Diego, CA     238 0.0275
##  9 Dallas, TX (2016)  Chicago, IL      2708 0.310 
## 10 Dallas, TX (2016)  Las Vegas, NV    1140 0.130 
## # ... with 94 more rows

Classifications are broadly as expected based on climates by locale. Variable importances are plotted:

helperPlotVarImp(rf_types_2014_2019_TDmcwha$rfModel)

Dew point and temperature are strong factors for separating the four cities in this analysis. Month, SLP, minimum cloud height, and prevailing wind direction are also meaningful.

An assessment can be run for the 2014-2019 model:

# Run for the full model including SLP
probs_2014_2019_TDmcwha <- 
    assessPredictionCertainty(rf_types_2014_2019_TDmcwha, 
                              keyVar="city", 
                              plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind, SLP", 
                              showAcc=TRUE
                              )
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(keyVar)` instead of `keyVar` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(pkVars)` instead of `pkVars` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

  • Predictions with 80%+ of the votes are made ~75% of the time, and these predictions are ~99% accurate
  • Predictions with <80% of the votes are made ~25% of the times, and these predictions are ~80% accurate
  • The percentage of votes received appears to be a reasonable proxy for the confidence of the prediction

A similar process can be run for assessing the classification of the other cities against the 2014-2019 data for Chicago, Las Vegas, New Orleans, and San Diego:

useData <- metarData %>%
    filter(!(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"))) %>%
    mutate(hr=lubridate::hour(dtime))
    
# Run for the model excluding SLP
probs_allcities_2014_2019_TDmcwh <- 
    assessPredictionCertainty(rf_types_2014_2019_TDmcwha, 
                              testData=useData,
                              keyVar="locale", 
                              plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind, modSLP", 
                              showHists=TRUE
                              )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The model is frequently not so confident in assigning an archetype to related cities, though it frequently gets the most sensible assignment.

Random Forest Classification (2016)

Next, an attempt is made to compare every grouping of two cities, using all variables, mtry of 4, and a very small forest of 15 trees:

# Create a container list to hold the output
list_varimp_2016 <- vector("list", 0.5*length(names_2016)*(length(names_2016)-1))

# Set a random seed
set.seed(2007031342)

# Loop through all possible combinations
n <- 1
for (ctr in 1:(length(names_2016)-1)) {
    for (ctr2 in (ctr+1):length(names_2016)) {
        list_varimp_2016[[n]] <- rfTwoLocales(mutate(metarData, hr=lubridate::hour(dtime)), 
                                              loc1=names_2016[ctr], 
                                              loc2=names_2016[ctr2], 
                                              vrbls=c("TempF", "DewF", 
                                                      "month", "hr",
                                                      "minHeight", "ceilingHeight", 
                                                      "WindSpeed", "predomDir", 
                                                      "modSLP", "Altimeter"
                                                      ),
                                              ntree=15, 
                                              mtry=4
                                              )
        n <- n + 1
        if ((n %% 40) == 0) { cat("Through number:", n, "\n")}
    }
}
## Through number: 40 
## Through number: 80 
## Through number: 120 
## Through number: 160 
## Through number: 200 
## Through number: 240 
## Through number: 280 
## Through number: 320 
## Through number: 360 
## Through number: 400
# Create a tibble from the underlying accuracy data
acc_varimp_2016 <- map_dfr(list_varimp_2016, .f=helperAccuracyLocale)

# Assess the top 20 classification accuracies
acc_varimp_2016 %>%
    arrange(-accOverall) %>%
    head(20)
## # A tibble: 20 x 5
##    locale1                locale2               accOverall accLocale1 accLocale2
##    <chr>                  <chr>                      <dbl>      <dbl>      <dbl>
##  1 Denver, CO (2016)      Miami, FL (2016)           0.998      0.998      0.998
##  2 Denver, CO (2016)      Tampa Bay, FL (2016)       0.996      0.998      0.995
##  3 Las Vegas, NV (2016)   Miami, FL (2016)           0.995      0.995      0.995
##  4 Denver, CO (2016)      New Orleans, LA (201~      0.995      0.996      0.994
##  5 Denver, CO (2016)      San Diego, CA (2016)       0.994      0.994      0.994
##  6 Denver, CO (2016)      Houston, TX (2016)         0.993      0.994      0.992
##  7 Denver, CO (2016)      San Francisco, CA (2~      0.993      0.994      0.992
##  8 Miami, FL (2016)       Seattle, WA (2016)         0.992      0.990      0.994
##  9 Miami, FL (2016)       Phoenix, AZ (2016)         0.992      0.991      0.992
## 10 Miami, FL (2016)       Traverse City, MI (2~      0.991      0.993      0.990
## 11 Miami, FL (2016)       Minneapolis, MN (201~      0.991      0.991      0.990
## 12 Boston, MA (2016)      Miami, FL (2016)           0.991      0.990      0.991
## 13 Denver, CO (2016)      Los Angeles, CA (201~      0.991      0.993      0.989
## 14 Denver, CO (2016)      San Jose, CA (2016)        0.990      0.991      0.989
## 15 Denver, CO (2016)      San Antonio, TX (201~      0.990      0.992      0.988
## 16 Grand Rapids, MI (201~ Miami, FL (2016)           0.990      0.990      0.990
## 17 Green Bay, WI (2016)   Miami, FL (2016)           0.990      0.989      0.990
## 18 Miami, FL (2016)       Milwaukee, WI (2016)       0.989      0.989      0.989
## 19 Madison, WI (2016)     Miami, FL (2016)           0.989      0.986      0.992
## 20 Las Vegas, NV (2016)   Tampa Bay, FL (2016)       0.989      0.990      0.987
# Assess the bottom 20 classification accuracies
acc_varimp_2016 %>%
    arrange(accOverall) %>%
    head(20)
## # A tibble: 20 x 5
##    locale1                locale2               accOverall accLocale1 accLocale2
##    <chr>                  <chr>                      <dbl>      <dbl>      <dbl>
##  1 Chicago, IL (2016)     Milwaukee, WI (2016)       0.675      0.685      0.666
##  2 Newark, NJ (2016)      Philadelphia, PA (20~      0.676      0.687      0.665
##  3 Detroit, MI (2016)     Grand Rapids, MI (20~      0.712      0.718      0.706
##  4 Madison, WI (2016)     Milwaukee, WI (2016)       0.718      0.725      0.711
##  5 Philadelphia, PA (201~ Washington, DC (2016)      0.730      0.741      0.718
##  6 Chicago, IL (2016)     Grand Rapids, MI (20~      0.731      0.750      0.711
##  7 Green Bay, WI (2016)   Madison, WI (2016)         0.737      0.747      0.728
##  8 Chicago, IL (2016)     Detroit, MI (2016)         0.740      0.746      0.735
##  9 Chicago, IL (2016)     Madison, WI (2016)         0.745      0.756      0.733
## 10 Grand Rapids, MI (201~ Milwaukee, WI (2016)       0.746      0.747      0.744
## 11 Green Bay, WI (2016)   Milwaukee, WI (2016)       0.754      0.757      0.751
## 12 Madison, WI (2016)     Minneapolis, MN (201~      0.758      0.759      0.757
## 13 Grand Rapids, MI (201~ Madison, WI (2016)         0.759      0.763      0.755
## 14 Detroit, MI (2016)     Milwaukee, WI (2016)       0.765      0.777      0.753
## 15 Grand Rapids, MI (201~ Traverse City, MI (2~      0.769      0.770      0.768
## 16 Detroit, MI (2016)     Indianapolis, IN (20~      0.772      0.775      0.770
## 17 Chicago, IL (2016)     Indianapolis, IN (20~      0.775      0.777      0.774
## 18 Boston, MA (2016)      Newark, NJ (2016)          0.778      0.781      0.776
## 19 Indianapolis, IN (201~ Saint Louis, MO (201~      0.779      0.791      0.768
## 20 Madison, WI (2016)     Traverse City, MI (2~      0.783      0.782      0.784

The best accuracies are obtained when comparing cities in very different climates (e.g., Denver vs. Humid/Marine or Miami vs. Desert/Cold), while the worst accuracies are obtained when comparing very similar cities (e.g., Chicago and Milwaukee or Newar and Philadelphia).

Variable importance can then be assessed across all 1:1 classifications:

# Create a tibble of all the variable importance data
val_varimp_2016 <- map_dfr(list_varimp_2016, 
                           .f=function(x) { x$rfModel %>% 
                                   caret::varImp() %>% 
                                   t() %>% 
                                   as.data.frame()
                               }
                           ) %>% 
    tibble::as_tibble()
# Create boxplot of overall variable importance
val_varimp_2016 %>% 
    mutate(num=1:nrow(val_varimp_2016)) %>% 
    pivot_longer(-num, names_to="variable", values_to="varImp") %>% 
    ggplot(aes(x=fct_reorder(variable, varImp), y=varImp)) + 
    geom_boxplot(fill="lightblue") + 
    labs(x="", 
         y="Variable Importance", 
         title="Variable Importance for 1:1 Random Forest Classifications"
         )

# Attach the city names and OOB error rate
tbl_varimp_2016 <- sapply(list_varimp_2016, 
                          FUN=function(x) { c(names(x$errorRate[2:3]), x$errorRate["OOB"]) }
                          ) %>%
    t() %>% 
    as.data.frame() %>% 
    bind_cols(val_varimp_2016) %>% 
    tibble::as_tibble() %>% 
    mutate(OOB=as.numeric(as.character(OOB))) %>%
    rename(locale1=V1, 
           locale2=V2
           )

# Plot accuracy vs. spikiness of variable importance
tbl_varimp_2016 %>%
    pivot_longer(-c(locale1, locale2, OOB), names_to="var", values_to="varImp") %>% 
    group_by(locale1, locale2, OOB) %>% 
    summarize(mean=mean(varImp), max=max(varImp)) %>% 
    mutate(maxMean=max/mean) %>%
    ggplot(aes(x=maxMean, y=1-OOB)) + 
    geom_point() + 
    geom_smooth(method="loess") +
    labs(x="Ratio of Maximum Variable Importance to Mean Variable Importance", 
         y="OOB Accuracy", 
         title="Accuracy vs. Spikiness of Variable Importance"
         )

Broadly speaking, the same variables that drive overall classification are important in driving 1:1 classifications. There is meaningful spikiness, suggesting that different 1:1 classifications rely on different variables.

There is a strong trend where the best accuracies are obtained where there is a single spiky dimension that drives the classifications. This suggests that while the model can take advantage of all 10 variables, it has the easiest tome when there is a single, well-differentiated variable. No surprise.

Random forest regression of temperatures in select locales for 2014-2019

Random forests can also be used to run regressions, such as on variables like temperature or dew point. Models are run for the 2014-2019 data for the locales that have data availability (Chicago, IL; Las Vegas, NV; New Orleans, LA; San Diego, CA):

# Create list of locations
fullDataLocs <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")

# Create a main list, one per locale
lstFullData <- vector("list", length(fullDataLocs))


# Create a list of relevant dependent variables and variables to keep
depVarFull <- c('hrfct', 'DewF', 'modSLP', 'Altimeter', 'WindSpeed', 
                'predomDir', 'minHeight', 'ceilingHeight'
                )
keepVarFull <- c('source', 'dtime', 'locNamefct', 'year', 'month', 'hrfct', 'DewF', 'modSLP', 
                 'Altimeter', 'WindSpeed', 'predomDir', 'minHeight', 'ceilingHeight'
                 )


# Run the regressions by locale and month
nLoc <- 1
for (loc in fullDataLocs) {
    
    # Pull data for only this locale, and where TempF is not missing
    pullData <- metarData %>%
        filter(locNamefct==loc, !is.na(TempF))
    
    # Create the months to be run
    fullDataMonths <- pullData %>%
        count(month) %>%
        pull(month)
    
    # Create containers for each run
    lstFullData[[nLoc]] <- vector("list", length(fullDataMonths))
    
    # Run random forest regression for each month for the locale
    cat("\nBeginning to process:", loc)
    nMonth <- 1
    for (mon in fullDataMonths) {
        
        # Run the regression
        lstFullData[[nLoc]][[nMonth]] <- rfRegression(pullData, 
                                                      depVar="TempF", 
                                                      predVars=depVarFull, 
                                                      otherVar=keepVarFull,
                                                      critFilter=list(locNamefct=loc, month=mon), 
                                                      seed=2007271252, 
                                                      ntree=100, 
                                                      mtry=4, 
                                                      testSize=0.3
                                                      )
        
        # Increment the counter
        nMonth <- nMonth + 1
        cat("\nFinished month:", mon)
    }
    
    # Incerement the counter
    nLoc <- nLoc + 1
    
}
## 
## Beginning to process: Chicago, IL
## Finished month: Jan
## Finished month: Feb
## Finished month: Mar
## Finished month: Apr
## Finished month: May
## Finished month: Jun
## Finished month: Jul
## Finished month: Aug
## Finished month: Sep
## Finished month: Oct
## Finished month: Nov
## Finished month: Dec
## Beginning to process: Las Vegas, NV
## Finished month: Jan
## Finished month: Feb
## Finished month: Mar
## Finished month: Apr
## Finished month: May
## Finished month: Jun
## Finished month: Jul
## Finished month: Aug
## Finished month: Sep
## Finished month: Oct
## Finished month: Nov
## Finished month: Dec
## Beginning to process: New Orleans, LA
## Finished month: Jan
## Finished month: Feb
## Finished month: Mar
## Finished month: Apr
## Finished month: May
## Finished month: Jun
## Finished month: Jul
## Finished month: Aug
## Finished month: Sep
## Finished month: Oct
## Finished month: Nov
## Finished month: Dec
## Beginning to process: San Diego, CA
## Finished month: Jan
## Finished month: Feb
## Finished month: Mar
## Finished month: Apr
## Finished month: May
## Finished month: Jun
## Finished month: Jul
## Finished month: Aug
## Finished month: Sep
## Finished month: Oct
## Finished month: Nov
## Finished month: Dec

The relevant ‘testData’ files can then be combined for an assessment of overall prediction accuracy:

# Helper function to extract testData from inner list
combineTestData <- function(lst, elem="testData") {
    map_dfr(lst, .f=function(x) x[[elem]])
}

# Combine all of the test data files
fullTestData <- map_dfr(lstFullData, .f=combineTestData) %>%
    mutate(err=predicted-TempF, 
           year=factor(year)
           )

# Helper function to create RMSE data
helperCreateRMSE <- function(df, byVar, depVar, errVar="err") {
    
    df %>%
        group_by_at(vars(all_of(byVar))) %>%
        summarize(varTot=var(get(depVar)), varModel=mean(get(errVar)**2)) %>%
        mutate(rmseTot=varTot**0.5, rmseModel=varModel**0.5, rsq=1-varModel/varTot)
    
}

# Create plot for a given by-variable and facet-variable
helperRMSEPlot <- function(df, byVar, depVar, facetVar=NULL) {

    # Create a copy of the original by variable
    byVarOrig <- byVar
    
    # Expand byVar to include facetVar if facetVar is not null
    if (!is.null(facetVar)) {
        byVar <- unique(c(byVar, facetVar))
    }
    
    # Create 
    p1 <- df %>%
        helperCreateRMSE(byVar=byVar, depVar=depVar) %>%
        select_at(vars(all_of(c(byVar, "rmseTot", "rmseModel")))) %>%
        pivot_longer(c(rmseTot, rmseModel), names_to="model", values_to="rmse") %>%
        group_by_at(vars(all_of(byVar))) %>%
        mutate(dRMSE=ifelse(row_number()==n(), rmse, rmse-lead(rmse)), 
               model=factor(model, levels=c("rmseTot", "rmseModel"))
               ) %>%
        ggplot(aes_string(x=byVarOrig, y="dRMSE", fill="model")) + 
        geom_col() + 
        geom_text(data=~filter(., model=="rmseModel"), aes(y=dRMSE/2, label=round(dRMSE, 1))) +
        coord_flip() + 
        labs(x="", y="RMSE", title="RMSE before and after modelling") + 
        scale_fill_discrete("", 
                            breaks=c("rmseModel", "rmseTot"), 
                            labels=c("Final", "Explained by Model")
                            ) + 
        theme(legend.position="bottom")
    # Add facetting if the argument was passed
    if (!is.null(facetVar)) { p1 <- p1 + facet_wrap(as.formula(paste("~", facetVar))) }
    print(p1)
    
}

# Stand-alone on three main dimensions
helperRMSEPlot(fullTestData, byVar="locNamefct", depVar="TempF")

helperRMSEPlot(fullTestData, byVar="year", depVar="TempF")

helperRMSEPlot(fullTestData, byVar="month", depVar="TempF")

# Facetted by locale
helperRMSEPlot(fullTestData, byVar="year", depVar="TempF", facetVar="locNamefct")

helperRMSEPlot(fullTestData, byVar="month", depVar="TempF", facetVar="locNamefct")

Further, an overall decline in MSE can be estimated as the average of the MSE declines in each locale-month:

# Function to extract MSE data from inner lists
helperMSETibble <- function(x) { 
    map_dfr(x, .f=function(y) tibble::tibble(ntree=1:length(y$mse), mse=y$mse)) 
}

map_dfr(lstFullData, .f=function(x) { helperMSETibble(x) }, .id="source") %>%
    group_by(source, ntree) %>%
    summarize(meanmse=mean(mse)) %>%
    ungroup() %>%
    mutate(source=fullDataLocs[as.integer(source)]) %>%
    ggplot(aes(x=ntree, y=meanmse, group=source, color=source)) + 
    geom_line() + 
    ylim(c(0, NA)) + 
    labs(x="# Trees", y="MSE", title="Evolution of Average MSE by Number of Trees")

At 100 trees, the model appears to have largely completed learning, with no more material declines in MSE. Overall, model predictions average 3-4 degrees different from actual temperatures. Deviations are greater in Las Vegas (4-5 degrees), and in the spring in Chicago (4-5 degrees). Deviations are lesser in San Diego (2-3 degrees) and winter in Chicago (2-3 degrees).

The model is then run for all months combined for a single locale, to compare results when month is a trained explanatory variable rather than a segment modelled separately:

# Create list of locations
fullDataLocs <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")

# Create a main list, one per locale
lstFullData_002 <- vector("list", length(fullDataLocs))


# Create a list of relevant dependent variables and variables to keep
depVarFull_002 <- c('month', 'hrfct', 'DewF', 'modSLP', 
                    'Altimeter', 'WindSpeed', 'predomDir', 
                    'minHeight', 'ceilingHeight'
                    )
keepVarFull_002 <- c('source', 'dtime', 'locNamefct', 'year', 'month', 'hrfct', 
                     'DewF', 'modSLP', 'Altimeter', 'WindSpeed', 'predomDir', 
                     'minHeight', 'ceilingHeight'
                     )


# Run the regressions by locale and month
nLoc <- 1
for (loc in fullDataLocs) {
    
    # Pull data for only this locale, and where TempF is not missing
    pullData <- metarData %>%
        filter(locNamefct==loc, !is.na(TempF))
    
    # To be parallel with previous runs, make a length-one list inside locale
    lstFullData_002[[nLoc]] <- vector("list", 1)
    
    # Run random forest regression for each locale
    cat("\nBeginning to process:", loc)
    lstFullData_002[[nLoc]][[1]] <- rfRegression(pullData, 
                                                 depVar="TempF", 
                                                 predVars=depVarFull_002, 
                                                 otherVar=keepVarFull_002,
                                                 critFilter=list(locNamefct=loc), 
                                                 seed=2007281307, 
                                                 ntree=25, 
                                                 mtry=4, 
                                                 testSize=0.3
                                                 )
    
    # Incerement the counter
    nLoc <- nLoc + 1
    
}
## 
## Beginning to process: Chicago, IL
## Beginning to process: Las Vegas, NV
## Beginning to process: New Orleans, LA
## Beginning to process: San Diego, CA

The results can then be compared to the results of the regressions run using month as a segment:

# Combine all of the test data files
fullTestData_002 <- map_dfr(lstFullData_002, .f=combineTestData) %>%
    mutate(err=predicted-TempF, 
           year=factor(year)
           )

# Stand-alone on three main dimensions
helperRMSEPlot(fullTestData_002, byVar="locNamefct", depVar="TempF")

helperRMSEPlot(fullTestData_002, byVar="year", depVar="TempF")

helperRMSEPlot(fullTestData_002, byVar="month", depVar="TempF")

# Facetted by locale
helperRMSEPlot(fullTestData_002, byVar="year", depVar="TempF", facetVar="locNamefct")

helperRMSEPlot(fullTestData_002, byVar="month", depVar="TempF", facetVar="locNamefct")

# Evolution of RMSE
map_dfr(lstFullData_002, .f=function(x) { helperMSETibble(x) }, .id="source") %>%
    group_by(source, ntree) %>%
    summarize(meanmse=mean(mse)) %>%
    ungroup() %>%
    mutate(source=fullDataLocs[as.integer(source)]) %>%
    ggplot(aes(x=ntree, y=meanmse, group=source, color=source)) + 
    geom_line() + 
    ylim(c(0, NA)) + 
    labs(x="# Trees", y="MSE", title="Evolution of Average MSE by Number of Trees")

The prediction qualities and evolution of MSE by number of trees look broadly similar to the results run by locale-month. Notably, month scores high on variable importance:

impList <- lapply(lstFullData_002, FUN=function(x) { 
    locName <- x[[1]]$testData$locNamefct %>% as.character() %>% unique()
    x[[1]]$rfModel$importance %>% 
        as.data.frame() %>%
        rownames_to_column("variable") %>%
        rename_at(vars(all_of("IncNodePurity")), ~locName) %>%
        tibble::as_tibble()
    }
    )

impDF <- Reduce(function(x, y) merge(x, y, all=TRUE), impList)

# Overall variable importance
impDF %>%
    pivot_longer(-variable, names_to="locale", values_to="incPurity") %>%
    ggplot(aes(x=fct_reorder(varMapper[variable], incPurity), y=incPurity)) + 
    geom_col() + 
    coord_flip() + 
    facet_wrap(~locale) + 
    labs(x="", y="Importance", title="Variable Importance by Locale")

# Relative variable importance
impDF %>%
    pivot_longer(-variable, names_to="locale", values_to="incPurity") %>%
    group_by(locale) %>%
    mutate(incPurity=incPurity/sum(incPurity)) %>%
    ggplot(aes(x=fct_reorder(varMapper[variable], incPurity), y=incPurity)) + 
    geom_col() + 
    coord_flip() + 
    facet_wrap(~locale) + 
    labs(x="", y="Relative Importance", title="Relative Variable Importance by Locale")

There is much more underlying variance in the Chicago data, thus greater overall variable importance in Chicago. On a relative basis, locale predictions are driven by:

  • Chicago - Dew Point, Month
  • Las Vegas - Month, Sea-Level Pressure, Hour, Altimeter
  • New Orleans - Dew Point, Month
  • San Diego - Month, Hour, Dew Point

It is interesting to see the similarities in Chicago and New Orleans, with both having strong explanatory power from the combination of dew point and month, despite meaningfully different climates. As in previous analyses, Las Vegas and San Diego look different from each other and also different from Chicago/New Orleans.

XGB regression of temperatures in select locales for 2014-2019

Next, the xgboost (extreme gradient boosting) algorithm is attempted on the METAR dataset. The general recipe from CRAN is followed, which includes several processing steps:

  1. Convert factor variable(s) to binary variables using one-hot-encoding without intercept
  2. Capture the target output variable as a vector
  3. Train the model using xgboost::xgboost (can handle regression or classification)
  4. Check feature importances
  5. Plot the evolution in training RMSE and R-squared
  6. Assess accuracy on test dataset

Next, a very basic xgb model is attempted for predicting temperature. First, data are prepared:

# Take metarData and limit to 4 sources with 2014-2019 data
baseXGBData_big4 <- metarData %>%
    filter(locNamefct %in% c("Las Vegas, NV", "New Orleans, LA", "Chicago, IL", "San Diego, CA"), 
           !is.na(TempF)
           )

# Split in to test and train datasets
idxTrain_big4 <- sample(1:nrow(baseXGBData_big4), size=round(0.7*nrow(baseXGBData_big4)), replace=FALSE)
baseXGBTrain_big4 <- baseXGBData_big4[idxTrain_big4, ]
baseXGBTest_big4 <- baseXGBData_big4[-idxTrain_big4, ]

# Select only variables of interest
xgbTrainInput_big4 <- baseXGBTrain_big4 %>%
    select(TempF, 
           locNamefct, month, hrfct, 
           DewF, modSLP, Altimeter, WindSpeed, 
           predomDir, minHeight, ceilingHeight
           ) %>%
    mutate(locNamefct=fct_drop(locNamefct))

Then, the three modeling steps are run:

# Step 1: Convert to sparse matrix format using one-hot encoding with no intercept
xgbTrainSparse_big4 <- Matrix::sparse.model.matrix(TempF ~ . - 1, data=xgbTrainInput_big4)
str(xgbTrainSparse_big4)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:1404725] 1 3 12 15 18 21 22 26 37 47 ...
##   ..@ p       : int [1:61] 0 36584 73243 109800 146521 157810 170204 182182 194663 206744 ...
##   ..@ Dim     : int [1:2] 146521 60
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:146521] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:60] "locNamefctChicago, IL" "locNamefctLas Vegas, NV" "locNamefctNew Orleans, LA" "locNamefctSan Diego, CA" ...
##   ..@ x       : num [1:1404725] 1 1 1 1 1 1 1 1 1 1 ...
##   ..@ factors : list()
xgbTrainSparse_big4[1:6, ]
## 6 x 60 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 60 column names 'locNamefctChicago, IL', 'locNamefctLas Vegas, NV', 'locNamefctNew Orleans, LA' ... ]]
##                                                                              
## 1 . 1 . . . . 1 . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . .
## 2 1 . . . . . . 1 . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . .
## 3 . . . 1 . . . 1 . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . .
## 4 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
## 5 . 1 . . . . 1 . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . .
## 6 . . 1 . . . . . . . . . . 1 . . . . 1 . . . . . . . . . . . . . . . . . . .
##                                                            
## 1 17.06 1007.2 29.80  8 . . . . . . 1 . . . . . . 1 . . . 1
## 2 48.92 1011.7 29.89 16 . . . . . . 1 . . . . . . 1 . . . 1
## 3 55.94 1012.6 29.90  7 . . . . . . . . 1 . 1 . . . . . . 1
## 4 26.96 1016.9 30.01  8 . . . . . . . . 1 . 1 . . . . . 1 .
## 5  5.00 1016.0 30.05  6 . . . . . 1 . . . . . . . 1 . . . 1
## 6 46.04 1019.4 30.10  9 . . . . 1 . . . . . . . . 1 . . . 1
# Step 2: Create the target output variable as a vector
xgbTrainOutput_big4 <- xgbTrainInput_big4$TempF
str(xgbTrainOutput_big4)
##  num [1:146521] 66.9 75 66.9 33.1 73.9 ...
# Step 3: Train the model using xgboost::xgboost, as regression
xgbModel_big4 <- xgboost::xgboost(data=xgbTrainSparse_big4, 
                                  label=xgbTrainOutput_big4, 
                                  nrounds=200, 
                                  print_every_n=20, 
                                  objective="reg:squarederror"
                                  )
## [1]  train-rmse:47.123901 
## [21] train-rmse:5.213187 
## [41] train-rmse:4.517478 
## [61] train-rmse:4.131727 
## [81] train-rmse:3.929088 
## [101]    train-rmse:3.758572 
## [121]    train-rmse:3.622754 
## [141]    train-rmse:3.511173 
## [161]    train-rmse:3.426437 
## [181]    train-rmse:3.357111 
## [200]    train-rmse:3.290211

Then, the three assessment steps are run:

# Step 4: Assess feature importances
xgbImportance_big4 <- xgboost::xgb.importance(feature_names=colnames(xgbTrainSparse_big4), 
                                              model=xgbModel_big4
                                              )

xgbImportance_big4 %>% 
    column_to_rownames("Feature") %>% 
    round(3)
##                            Gain Cover Frequency
## DewF                      0.417 0.068     0.172
## locNamefctChicago, IL     0.203 0.011     0.025
## Altimeter                 0.084 0.107     0.107
## modSLP                    0.072 0.120     0.146
## ceilingHeightNone         0.037 0.010     0.035
## locNamefctLas Vegas, NV   0.020 0.029     0.041
## monthJul                  0.018 0.011     0.009
## monthJun                  0.015 0.011     0.012
## monthAug                  0.015 0.009     0.008
## monthSep                  0.010 0.012     0.009
## monthDec                  0.008 0.023     0.011
## WindSpeed                 0.007 0.018     0.086
## monthFeb                  0.007 0.025     0.009
## minHeightMedium           0.006 0.006     0.021
## locNamefctSan Diego, CA   0.005 0.008     0.021
## monthMay                  0.005 0.020     0.013
## minHeightNone             0.005 0.015     0.022
## monthOct                  0.005 0.017     0.011
## hrfct21                   0.004 0.014     0.004
## hrfct20                   0.004 0.012     0.005
## minHeightHigh             0.003 0.010     0.018
## locNamefctNew Orleans, LA 0.003 0.007     0.014
## hrfct13                   0.003 0.019     0.003
## hrfct12                   0.003 0.020     0.004
## hrfct11                   0.003 0.020     0.004
## monthApr                  0.003 0.024     0.012
## hrfct22                   0.003 0.013     0.003
## hrfct19                   0.003 0.010     0.004
## hrfct10                   0.003 0.021     0.003
## hrfct23                   0.003 0.009     0.003
## hrfct9                    0.002 0.020     0.003
## minHeightLow              0.002 0.004     0.014
## hrfct18                   0.002 0.015     0.004
## hrfct14                   0.002 0.020     0.004
## monthMar                  0.002 0.022     0.008
## ceilingHeightHigh         0.002 0.004     0.008
## monthNov                  0.001 0.018     0.009
## hrfct8                    0.001 0.022     0.004
## hrfct15                   0.001 0.017     0.005
## predomDirVRB              0.001 0.007     0.003
## hrfct7                    0.001 0.020     0.003
## hrfct17                   0.001 0.012     0.005
## hrfct1                    0.001 0.009     0.004
## hrfct16                   0.001 0.013     0.006
## hrfct6                    0.001 0.020     0.003
## predomDirS                0.001 0.001     0.009
## hrfct2                    0.000 0.009     0.004
## predomDirNE               0.000 0.003     0.008
## ceilingHeightMedium       0.000 0.008     0.005
## hrfct5                    0.000 0.019     0.003
## predomDirSW               0.000 0.001     0.008
## ceilingHeightLow          0.000 0.003     0.006
## predomDirN                0.000 0.002     0.009
## hrfct3                    0.000 0.009     0.005
## predomDirNW               0.000 0.006     0.007
## predomDirW                0.000 0.002     0.008
## predomDirE                0.000 0.002     0.007
## hrfct4                    0.000 0.016     0.003
## predomDirSE               0.000 0.000     0.005
xgbImportance_big4 %>%
    ggplot(aes(x=fct_reorder(Feature, Gain), y=Gain)) + 
    geom_col(fill="lightblue") + 
    geom_text(aes(y=Gain+0.02, label=round(Gain, 3))) + 
    coord_flip() + 
    labs(x="", title="Gain by Variable for TempF modeling with xgboost")

# Step 5: Plot evolution in training data RMSE and R-squared
xgbModel_big4$evaluation_log %>%
    filter(iter %% 5 == 0) %>%
    ggplot(aes(x=iter, y=train_rmse)) + 
    geom_text(aes(label=round(train_rmse, 1)), size=3) + 
    labs(x="Number of iterations", y="Training Set RMSE", title="Evolution of RMSE on training data")

xgbModel_big4$evaluation_log %>%
    filter(iter %% 10 == 0) %>%
    mutate(overall_rmse=sd(baseXGBTrain_big4$TempF), rsq=1-train_rmse**2/overall_rmse**2) %>%
    ggplot(aes(x=iter, y=rsq)) + 
    geom_text(aes(y=rsq, label=round(rsq,3)), size=3) +
    labs(x="Number of iterations", y="Training Set R-squared", title="Evolution of R-squared on training data")

# Step 6: Assess accuracy on test dataset
xgbTestInput_big4 <- baseXGBTest_big4 %>%
    select(TempF, 
           locNamefct, month, hrfct, 
           DewF, modSLP, Altimeter, WindSpeed, 
           predomDir, minHeight, ceilingHeight
           ) %>%
    mutate(locNamefct=fct_drop(locNamefct))

xgbTestSparse_big4 <- Matrix::sparse.model.matrix(TempF ~ . - 1, data=xgbTestInput_big4)

xgbTest_big4 <- xgbTestInput_big4 %>%
    mutate(xgbPred=predict(xgbModel_big4, newdata=xgbTestSparse_big4), err=xgbPred-TempF)

xgbTest_big4 %>%
    group_by(locNamefct) %>%
    summarize(rmse_orig=sd(TempF), rmse_xgb=mean(err**2)**0.5) %>%
    mutate(rsq=1-rmse_xgb**2/rmse_orig**2)
## # A tibble: 4 x 4
##   locNamefct      rmse_orig rmse_xgb   rsq
##   <fct>               <dbl>    <dbl> <dbl>
## 1 Chicago, IL         21.2      3.91 0.966
## 2 Las Vegas, NV       18.3      3.96 0.953
## 3 New Orleans, LA     13.4      3.39 0.936
## 4 San Diego, CA        7.38     3.13 0.820
xgbTest_big4 %>%
    group_by(month) %>%
    summarize(rmse_orig=sd(TempF), rmse_xgb=mean(err**2)**0.5) %>%
    mutate(rsq=1-rmse_xgb**2/rmse_orig**2)
## # A tibble: 12 x 4
##    month rmse_orig rmse_xgb   rsq
##    <fct>     <dbl>    <dbl> <dbl>
##  1 Jan        17.4     3.74 0.954
##  2 Feb        17.6     3.65 0.957
##  3 Mar        14.9     3.76 0.936
##  4 Apr        12.5     4.06 0.894
##  5 May        11.5     3.72 0.895
##  6 Jun        12.3     3.42 0.923
##  7 Jul        11.2     3.22 0.918
##  8 Aug        10.1     3.09 0.907
##  9 Sep        10.1     3.68 0.866
## 10 Oct        11.7     3.76 0.897
## 11 Nov        14.1     3.81 0.927
## 12 Dec        14.2     3.36 0.944
xgbTest_big4 %>%
    group_by(TempF, rndPred=round(xgbPred)) %>%
    summarize(n=n()) %>%
    ggplot(aes(x=TempF, y=rndPred)) + 
    geom_point(aes(size=n), alpha=0.1) + 
    geom_smooth(aes(weight=n)) + 
    geom_abline(lty=2, color="red") + 
    labs(title="XGB predictions vs. actual on test dataset", y="Predicted Temperature", x="Actual Temperature")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

At a glance, initial prediction results are encouraging. The model runs very quickly and gets to a comparable RMSE/R-squared on test data as the random forest. Tuning parameters or adding cross-validation could potentially improve the algorithm further.

An initial conversion to functional form is made, leveraging some of the code already available in rfMultiLocale() and rfTwoLocales():

# Helper function to create sparse matrix without intercept (keep all factor levels)
# Need to adapt to keep all levels of all factors such as caret::dummyVars
helperMakeSparse <- function(tbl, depVar, predVars) {
    
    # FUNCTION ARGUMENTS
    # tbl: the tibble or data frame to be converted
    # depVar: the dependent variable (not to be included in the sparse matrix)
    # predVars: the predictor variables to be converted to sparse format
    
    # Filter to include only predVars then make sprase matrix modelling object
    # Include all contrast levels for every factor variable and exclude the intercept
    tbl %>%
        select_at(vars(all_of(c(predVars)))) %>%
        Matrix::sparse.model.matrix(~ . -1, 
                                    data=., 
                                    contrasts.arg=lapply(.[, sapply(., is.factor)], contrasts, contrasts=FALSE)
                                    )
    
}


# Run xgb model with desired parameters
xgbRunModel <- function(tbl, 
                        depVar, 
                        predVars,
                        otherVars=c("source", "dtime"),
                        critFilter=vector("list", 0),
                        dropEmptyLevels=TRUE,
                        seed=NULL, 
                        nrounds=200, 
                        print_every_n=nrounds, 
                        testSize=0.3, 
                        xgbObjective="reg:squarederror",
                        funcRun=xgboost::xgboost,
                        calcErr=TRUE,
                        ...
                        ) {
    
    # FUNCTION ARGUMENTS:
    # tbl: the data frame or tibble
    # depVar: the dependent variable that will be predicted
    # predVars: explanatory variables for modeling
    # otherVars: other variables to be kept in a final testData file, but not used in modeling
    # critFilter: named list of format list(varName=c(varValues))
    #             will include only observations where get(varName) %in% varValues
    #             vector("list", 0) creates a length-zero list, which never runs in the for loop
    # dropEmptyLevels: boolean, whether to run fct_drop on all variables of class factor after critFilter
    # seed: the random seed (NULL means no seed)
    # nrounds: the maximum number of boosting iterations
    # print_every_n: how frequently to print the progress of training error/accuracy while fitting
    # testSize: the fractional portion of data that should be used as the test dataset
    # xgbObjective: the objective function for xgboost
    # funcRun: the function to run, passed as a function
    # calcErr: boolean, whether to create variable err as predicted-get(depVar)
    # ...: additional arguments to be passed directly to xgboost
    
    # Check that funcName is valid and get the relevant function
    valFuncs <- c("xgboost", "xgb.cv")
    funcName <- as.character(substitute(funcRun))
    if (!(funcName[length(funcName)] %in% valFuncs)) {
        cat("\nFunction is currently only prepared for:", valFuncs, "\n")
        stop("Please change passed argument or update function\n")
    }
    
    # Filter such that only matches to critFilter are included
    for (xNum in seq_len(length(critFilter))) {
        tbl <- tbl %>%
            filter_at(vars(all_of(names(critFilter)[xNum])), ~. %in% critFilter[[xNum]])
    }
    
    # Keep only the depVar, predVar, and otherVars
    tbl <- tbl %>%
        select_at(vars(all_of(c(depVar, predVars, otherVars))))
    
    # Drop empty levels from factors if requested
    if (dropEmptyLevels) {
        tbl <- tbl %>%
            mutate_if(is.factor, .funs=fct_drop)
    }
    
    # Create test-train split
    ttLists <- createTestTrain(tbl, testSize=testSize, seed=seed)
    
    # Set the seed if requested
    if (!is.null(seed)) { set.seed(seed) }
    
    # Pull the dependent variable
    yTrain <- ttLists$trainData[, depVar, drop=TRUE]
    
    # Convert the dependent variable to be integers 0 to (n-1) if xgbObjective is "multi:.*"
    if (str_detect(xgbObjective, pattern="^multi:")) {
        # Convert to factor if passed as anything else
        if (!is.factor(yTrain)) yTrain <- factor(yTrain)
        # Save the factor levels so they can be added back later
        yTrainLevels <- levels(yTrain)
        # Convert to numeric 0 to n-1
        yTrain <- as.integer(yTrain) - 1
    } else {
        yTrainLevels <- NULL
    }
    
    # Convert predictor variables to sparse matrix format keeping only modeling variables
    sparseTrain <- helperMakeSparse(ttLists$trainData, depVar=depVar, predVars=predVars)
    sparseTest <- helperMakeSparse(ttLists$testData, depVar=depVar, predVars=predVars)
    
    # Train model
    xgbModel <- funcRun(data=sparseTrain, 
                        label=yTrain, 
                        nrounds=nrounds, 
                        print_every_n=print_every_n, 
                        objective=xgbObjective, 
                        ...
                        )

    # Create a base testData file that is NULL (will be for xgb.cv) and predData file that is NULL
    testData <- NULL
    predData <- NULL
    
    # Extract testData and add predictions if the model passed is xgboost
    if (funcName[length(funcName)] %in% c("xgboost")) {
        if (xgbObjective=="multi:softprob") {
            predData <- matrix(data=predict(xgbModel, newdata=sparseTest), 
                               nrow=nrow(ttLists$testData), 
                               ncol=length(yTrainLevels), 
                               byrow=TRUE
                               )
            maxCol <- apply(predData, 1, FUN=which.max)
            testData <- ttLists$testData %>%
                mutate(predicted=yTrainLevels[maxCol], probPredicted=apply(predData, 1, FUN=max))
            predData <- predData %>%
                as_tibble() %>%
                purrr::set_names(yTrainLevels)
        } else {
            testData <- ttLists$testData %>%
                mutate(predicted=predict(xgbModel, newdata=sparseTest))
            if (calcErr) { 
                testData <- testData %>% mutate(err=predicted-get(depVar))
            }
        }
    }
    
    # Return list containing funcName, trained model, and testData
    list(funcName=funcName[length(funcName)], 
         xgbModel=xgbModel, 
         testData=testData, 
         predData=predData,
         yTrainLevels=yTrainLevels
         )
    
}
# Define key predictor variables for base XGB runs
baseXGBPreds <- c("locNamefct", "month", "hrfct", 
                  "DewF", "modSLP", "Altimeter", "WindSpeed", 
                  "predomDir", "minHeight", "ceilingHeight"
                  )

# Core multi-year cities
multiYearLocales <- c("Las Vegas, NV", "New Orleans, LA", "Chicago, IL", "San Diego, CA")

# Run the function shell
xgbInit <- xgbRunModel(filter(metarData, !is.na(TempF)), 
                       depVar="TempF", 
                       predVars=baseXGBPreds, 
                       otherVars=c("source", "dtime"), 
                       critFilter=list(locNamefct=multiYearLocales),
                       seed=2008011825,
                       nrounds=2000,
                       print_every_n=50
                       )
## [1]  train-rmse:47.096767 
## [51] train-rmse:4.079394 
## [101]    train-rmse:3.584855 
## [151]    train-rmse:3.299586 
## [201]    train-rmse:3.106099 
## [251]    train-rmse:2.970594 
## [301]    train-rmse:2.866863 
## [351]    train-rmse:2.776774 
## [401]    train-rmse:2.697801 
## [451]    train-rmse:2.626064 
## [501]    train-rmse:2.572620 
## [551]    train-rmse:2.518776 
## [601]    train-rmse:2.469624 
## [651]    train-rmse:2.426183 
## [701]    train-rmse:2.390440 
## [751]    train-rmse:2.350134 
## [801]    train-rmse:2.315014 
## [851]    train-rmse:2.276004 
## [901]    train-rmse:2.244308 
## [951]    train-rmse:2.213070 
## [1001]   train-rmse:2.184253 
## [1051]   train-rmse:2.156494 
## [1101]   train-rmse:2.133281 
## [1151]   train-rmse:2.104960 
## [1201]   train-rmse:2.079930 
## [1251]   train-rmse:2.053286 
## [1301]   train-rmse:2.030751 
## [1351]   train-rmse:2.010247 
## [1401]   train-rmse:1.989779 
## [1451]   train-rmse:1.966085 
## [1501]   train-rmse:1.944863 
## [1551]   train-rmse:1.920303 
## [1601]   train-rmse:1.897431 
## [1651]   train-rmse:1.877888 
## [1701]   train-rmse:1.860289 
## [1751]   train-rmse:1.836947 
## [1801]   train-rmse:1.817471 
## [1851]   train-rmse:1.800320 
## [1901]   train-rmse:1.783760 
## [1951]   train-rmse:1.767166 
## [2000]   train-rmse:1.754318

Functions can then be written to assess the quality of the predictions:

# Create and plot importance for XGB model
plotXGBImportance <- function(mdl, 
                              subList="xgbModel", 
                              featureStems=NULL, 
                              showMainPlot=TRUE, 
                              showStemPlot=!is.null(featureStems), 
                              stemMapper=NULL,
                              plotTitle="Gain by Variable for xgboost", 
                              plotSubtitle=NULL
                              ) {
    
    # FUNCTION ARGUMENTS:
    # mdl: the xgb.Booster model file, or a list containing the xgb.Booster model file
    # subList: if mdl is a list, attempt to pull out item named in subList
    # featureStems: aggregate features starting with this vector of stems, and plot sum of gain 
    #               (NULL means do not do this and just plot the gains "as is" )
    # showMainPlot: boolean, whether to create the full importance plot (just return importance data otherwise)
    # showStemPlot: boolean, whether to create the plot summed by stems
    # stemMapper: mapping file to convert stem variables to descriptive names (NULL means leave as-is)
    # plotTitle: title to be included on the importance plots
    # plotSubtitle: subtitle to be included on the importance plots (NULL means no subtitle)

    # Pull out the modeling data from the list if needed
    if (!("xgb.Booster" %in% class(mdl))) {
        mdl <- mdl[[subList]]
    }
    
    # Pull out the feature importances
    xgbImportance <- xgboost::xgb.importance(model=mdl)
    
    # Helper function to sum data to stem (called below if featureStems is not NULL)
    helperStemTotal <- function(pattern, baseData=xgbImportance) {
        baseData %>%
            filter(grepl(pattern=paste0("^", pattern), x=Feature)) %>%
            select_if(is.numeric) %>%
            colSums()
    }
    
    # Create sums by stem if requested
    if (!is.null(featureStems)) {
        stemTotals <- sapply(featureStems, FUN=function(x) { helperStemTotal(pattern=x) } ) %>%
            t() %>%
            as.data.frame() %>%
            rownames_to_column("Feature") %>%
            tibble::as_tibble()
        print(stemTotals)
    } else {
        stemTotals <- NULL
    }

    # Helper function to plot gain by Feature
    helperPlotGain <- function(df, title, subtitle, mapper=NULL, caption=NULL) {
        # Add descriptive name if mapper is passed
        if (!is.null(mapper)) df <- df %>% mutate(Feature=paste0(Feature, "\n", mapper[Feature]))
        p1 <- df %>%
            ggplot(aes(x=fct_reorder(Feature, Gain), y=Gain)) + 
            geom_col(fill="lightblue") + 
            geom_text(aes(y=Gain+0.02, label=round(Gain, 3))) + 
            coord_flip() + 
            labs(x="", title=title)
        if (!is.null(subtitle)) { p1 <- p1 + labs(subtitle=subtitle) }
        if (!is.null(caption)) { p1 <- p1 + labs(caption=caption) }
        print(p1)
    }
    
    # Create and display the plots if requested
    if (showMainPlot) helperPlotGain(xgbImportance, title=plotTitle, subtitle=plotSubtitle)
    if (showStemPlot) { 
        helperPlotGain(stemTotals, 
                       title=plotTitle, 
                       subtitle=plotSubtitle,
                       mapper=stemMapper,
                       caption="Factor variables summed by stem"
                       )
    }
    
    # Return a list containing the raw data and the stemmed data (can be NULL)
    list(importanceData=xgbImportance, stemData=stemTotals)
    
}

# Find and plot importances
xgbInit_importance <- plotXGBImportance(xgbInit, 
                                        featureStems=baseXGBPreds, 
                                        stemMapper = varMapper, 
                                        plotTitle="Gain by variable in xgboost", 
                                        plotSubtitle="Modeling Temperature (F) in 4 Locales 2014-2019"
                                        )
## # A tibble: 10 x 4
##    Feature          Gain  Cover Frequency
##    <chr>           <dbl>  <dbl>     <dbl>
##  1 locNamefct    0.241   0.0696    0.0496
##  2 month         0.123   0.0810    0.122 
##  3 hrfct         0.0460  0.104     0.159 
##  4 DewF          0.387   0.134     0.166 
##  5 modSLP        0.0682  0.288     0.158 
##  6 Altimeter     0.0724  0.197     0.0815
##  7 WindSpeed     0.00745 0.0446    0.100 
##  8 predomDir     0.00468 0.0375    0.0778
##  9 minHeight     0.0168  0.0250    0.0505
## 10 ceilingHeight 0.0332  0.0202    0.0350

# Function to create and plot RMSE and R-squared evolution of training data
plotXGBTrainEvolution <- function(mdl, 
                                  fullSD,
                                  subList="xgbModel", 
                                  plotRMSE=TRUE,
                                  plotR2=TRUE, 
                                  plot_every=10
                                  ) {
    
    # FUNCTION ARGUMENTS:
    # mdl: the xgb.Booster model file, or a list containing the xgb.Booster model file
    # fullSD: the overall stansard deviation for the training variable
    #         if passed as numeric, use as-is
    #         if passed as character, try to extract sd of that variable from 'testData' in mdl
    # subList: if mdl is a list, attempt to pull out item named in subList
    # plotRMSE: boolean, whether to create the RMSE evolution plot
    # plotR2: boolean, whether to create the R2 evolution plot
    # plot_every: how often to plot the RMSE/R-squared data (e.g., 10 means print iter 10, 20, 30, etc.)

    # Create the full standard deviation from 'testData' if passed as character
    # Must be run before mdl is converted out of list format
    if (is.character(fullSD)) {
        fullSD <- mdl[["testData"]] %>% pull(fullSD) %>% sd()
    }
    
    # Pull out the modeling data from the list if needed
    if (!("xgb.Booster" %in% class(mdl))) {
        mdl <- mdl[[subList]]
    }
    
    # Extract the evaluation log and add an approximated R-squared
    rmseR2 <- mdl[["evaluation_log"]] %>%
        mutate(overall_rmse=fullSD, rsq=1-train_rmse**2/overall_rmse**2) %>%
        tibble::as_tibble()
    
    # Helper function to create requested plot(s)
    helperPlotEvolution <- function(df, yVar, rnd, desc, size=3, plot_every=1) {
        p1 <- df %>%
            filter((iter %% plot_every) == 0) %>%
            ggplot(aes_string(x="iter", y=yVar)) + 
            geom_text(aes(label=round(get(yVar), rnd)), size=size) + 
            labs(x="Number of iterations", 
                 y=paste0("Training Set ", desc), 
                 title=paste0("Evolution of ", desc, " on training data")
                 )
        print(p1)
    }
    
    # Create the RMSE and R-squared plots if requested
    if (plotRMSE) helperPlotEvolution(rmseR2, yVar="train_rmse", rnd=1, desc="RMSE", plot_every=plot_every)
    if (plotR2) helperPlotEvolution(rmseR2, yVar="rsq", rnd=3, desc="R-squared", plot_every=plot_every)
    
    # Return the full evolution data frame
    rmseR2
    
}

# Plot the evolution
xgbInitEvolution <- plotXGBTrainEvolution(xgbInit, fullSD="TempF", plot_every=50)

# Function to report on, and plot, prediction caliber on testData
plotXGBTestData <- function(mdl, 
                            depVar,
                            predVar="predicted",
                            subList="testData", 
                            reportOverall=TRUE,
                            reportBy=NULL, 
                            showPlot=TRUE
                            ) {
    
    # FUNCTION ARGUMENTS:
    # mdl: the test data file, or a list containing the test data file
    # depVar: the variable that was predicted
    # predVar: the variable containing the prediction for depVar
    # subList: if mdl is a list, attempt to pull out item named in subList
    # reportOverall: boolean, whether to report an overall RMSE/R2 on test data
    # reportBy: variable for sumarizing RMSE/R2 by (NULL means no RMSE/R2 by any grouping variables)
    # showPlot: boolean, whether to create/show the plot of predictions vs actuals
    
    # Pull out the modeling data from the list if needed
    if ("list" %in% class(mdl)) {
        mdl <- mdl[[subList]]
    }
    
    # Helper function to print RMSE and R2
    helperReportRMSER2 <- function(df, depVar, errVar) {
        df %>%
            summarize(rmse_orig=sd(get(depVar)), 
                      rmse_xgb=mean((get(predVar)-get(depVar))**2)**0.5
                      ) %>%
            mutate(rsq=1-rmse_xgb**2/rmse_orig**2) %>%
            print()
        cat("\n")
    }
    
    # Report overall RMSE/R2 if requested
    if (reportOverall) {
        cat("\nOVERALL PREDICTIVE PERFORMANCE:\n\n")
        helperReportRMSER2(mdl, depVar=depVar, errVar=errVar)
        cat("\n")
    }
    
    # Report by grouping variables if any
    if (!is.null(reportBy)) {
        cat("\nPREDICTIVE PERFORMANCE BY GROUP(S):\n\n")
        sapply(reportBy, FUN=function(x) { 
            mdl %>% group_by_at(x) %>% helperReportRMSER2(depVar=depVar, errVar=errVar)
        }
        )
        cat("\n")
    }

    # Show overall model performance using rounded temperature and predictions
    if (showPlot) {
        p1 <- mdl %>%
            mutate(rndPred=round(get(predVar))) %>%
            group_by_at(vars(all_of(c(depVar, "rndPred")))) %>%
            summarize(n=n()) %>%
            ggplot(aes_string(x=depVar, y="rndPred")) + 
            geom_point(aes(size=n), alpha=0.1) + 
            geom_smooth(aes(weight=n)) + 
            geom_abline(lty=2, color="red") + 
            labs(title="XGB predictions vs. actual on test dataset", y="Predicted", x="Actual")
        print(p1)
    }
    
}

# Assess performance on test data
plotXGBTestData(xgbInit, depVar="TempF", reportBy=c("locNamefct", "month"))
## 
## OVERALL PREDICTIVE PERFORMANCE:
## 
## # A tibble: 1 x 3
##   rmse_orig rmse_xgb   rsq
##       <dbl>    <dbl> <dbl>
## 1      18.0     2.98 0.972
## 
## 
## 
## PREDICTIVE PERFORMANCE BY GROUP(S):
## 
## # A tibble: 4 x 4
##   locNamefct      rmse_orig rmse_xgb   rsq
##   <fct>               <dbl>    <dbl> <dbl>
## 1 Chicago, IL         21.1      3.24 0.976
## 2 Las Vegas, NV       18.3      2.75 0.977
## 3 New Orleans, LA     13.3      3.03 0.948
## 4 San Diego, CA        7.37     2.89 0.846
## 
## # A tibble: 12 x 4
##    month rmse_orig rmse_xgb   rsq
##    <fct>     <dbl>    <dbl> <dbl>
##  1 Jan        17.0     2.90 0.971
##  2 Feb        17.6     3.07 0.970
##  3 Mar        14.9     3.08 0.957
##  4 Apr        12.4     3.22 0.932
##  5 May        11.6     3.08 0.929
##  6 Jun        12.2     2.82 0.946
##  7 Jul        11.1     2.78 0.937
##  8 Aug        10.1     2.67 0.930
##  9 Sep        10.3     2.90 0.920
## 10 Oct        11.8     3.15 0.929
## 11 Nov        14.1     3.18 0.949
## 12 Dec        14.2     2.88 0.959
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

The XGB model appears to be over-fitting the training data, as reflected by the RMSE gap between the test data set and the training data set:

# Get the random forest by locale RMSE reported by the training process
rfByLocaleRMSE <- sapply(lstFullData_002, FUN=function(x) { x[[1]][["rfModel"]][["mse"]] }) %>%
    as.data.frame() %>%
    tibble::as_tibble() %>%
    mutate(train_rmse=apply(., 1, FUN=mean)**0.5)

# Get the random forest RMSE reported by the test data
rfTestRMSE <- fullTestData_002 %>% 
    summarize(rfTestrmse=mean(err**2)**0.5) %>% 
    pull(rfTestrmse)

# Get the XGB RMSE reported by the training process
xgbInitRMSE <- xgbInit$xgbModel$evaluation_log

# Get the XGB RMSE reported by the test data
xgbTestRMSE <- xgbInit$testData %>%
    summarize(xgbTestrmse=mean(err**2)**0.5) %>%
    pull(xgbTestrmse)

# Plot differences in train/test RMSE
tibble::tibble(model=c("Random Forest", "XGB"), 
               train=c(tail(rfByLocaleRMSE$train_rmse, 1), tail(xgbInitRMSE$train_rmse, 1)), 
               test=c(rfTestRMSE, xgbTestRMSE)
               ) %>%
    pivot_longer(-model, names_to="type", values_to="RMSE") %>%
    mutate(type=factor(type, levels=c("train", "test"))) %>%
    ggplot(aes(x=model, y=RMSE, fill=type)) + 
    geom_col(position="dodge") + 
    geom_text(aes(y=RMSE+0.2, label=round(RMSE, 1)), position=position_dodge(width=1)) +
    labs(x="", title="XGB test RMSE is meaningfully higher than XGB train RMSE (overfitting)") + 
    scale_fill_discrete("") + 
    theme(legend.position="bottom")

So, the XGB model either needs to be tuned, or it needs to run for a much smaller number of iterations. The function xgb.cv allows for cross-validation during the training process, which means that a running count of train/test RMSE is kept by iteration. Suppose that this function is run:

# Create the same training dataset as would have been passed to the previous XGB modeling
xgbInput_big4 <- metarData %>%
    filter(!is.na(TempF), locNamefct %in% multiYearLocales) %>%
    anti_join(xgbInit$testData %>% select(source, dtime)) %>%
    select_at(vars(all_of(c("TempF", baseXGBPreds, "source", "dtime")))) %>%
    mutate_if(is.factor, .funs=fct_drop)
## Joining, by = c("source", "dtime")
# Step 1: Convert to sparse matrix format using one-hot encoding with no intercept
xgbTrain_big4 <- helperMakeSparse(xgbInput_big4, predVars=baseXGBPreds)

# Step 2: Create the target output variable as a vector
xgbTarget_big4 <- xgbInput_big4$TempF

# Step 3: Train the model using xgboost::xgb.cv, as regression
xgbInit_cv <- xgboost::xgb.cv(data=xgbTrain_big4, 
                              label=xgbTarget_big4, 
                              nrounds=2000, 
                              nfold=5,
                              print_every_n=50, 
                              objective="reg:squarederror"
                              )
## [1]  train-rmse:47.098820+0.017976   test-rmse:47.099904+0.076497 
## [51] train-rmse:4.051789+0.021049    test-rmse:4.223803+0.019306 
## [101]    train-rmse:3.539190+0.006330    test-rmse:3.804860+0.016793 
## [151]    train-rmse:3.247462+0.010342    test-rmse:3.581248+0.012934 
## [201]    train-rmse:3.058367+0.008329    test-rmse:3.451586+0.015910 
## [251]    train-rmse:2.916631+0.013438    test-rmse:3.363312+0.014480 
## [301]    train-rmse:2.805518+0.011073    test-rmse:3.299380+0.011600 
## [351]    train-rmse:2.706593+0.006588    test-rmse:3.247507+0.010449 
## [401]    train-rmse:2.625376+0.006445    test-rmse:3.211091+0.008976 
## [451]    train-rmse:2.550301+0.006425    test-rmse:3.181705+0.009290 
## [501]    train-rmse:2.486581+0.009604    test-rmse:3.158292+0.008441 
## [551]    train-rmse:2.429987+0.012186    test-rmse:3.138479+0.008697 
## [601]    train-rmse:2.380056+0.010852    test-rmse:3.124328+0.008938 
## [651]    train-rmse:2.333796+0.011301    test-rmse:3.111427+0.008789 
## [701]    train-rmse:2.290457+0.008911    test-rmse:3.100707+0.009145 
## [751]    train-rmse:2.251564+0.008006    test-rmse:3.093594+0.008037 
## [801]    train-rmse:2.212418+0.007689    test-rmse:3.084205+0.007674 
## [851]    train-rmse:2.176226+0.010150    test-rmse:3.078630+0.006701 
## [901]    train-rmse:2.141665+0.009509    test-rmse:3.073187+0.006579 
## [951]    train-rmse:2.108831+0.009814    test-rmse:3.068691+0.005981 
## [1001]   train-rmse:2.075038+0.008304    test-rmse:3.064253+0.006078 
## [1051]   train-rmse:2.043687+0.008306    test-rmse:3.060923+0.005548 
## [1101]   train-rmse:2.016390+0.008564    test-rmse:3.058380+0.005650 
## [1151]   train-rmse:1.989501+0.006228    test-rmse:3.056679+0.005847 
## [1201]   train-rmse:1.962299+0.006104    test-rmse:3.054455+0.005260 
## [1251]   train-rmse:1.936194+0.007773    test-rmse:3.052410+0.005896 
## [1301]   train-rmse:1.908768+0.007372    test-rmse:3.051539+0.006504 
## [1351]   train-rmse:1.883683+0.007715    test-rmse:3.051109+0.006276 
## [1401]   train-rmse:1.859494+0.006603    test-rmse:3.050993+0.007563 
## [1451]   train-rmse:1.835040+0.004953    test-rmse:3.050164+0.007402 
## [1501]   train-rmse:1.811114+0.003270    test-rmse:3.049802+0.007833 
## [1551]   train-rmse:1.789402+0.002534    test-rmse:3.049649+0.008335 
## [1601]   train-rmse:1.767315+0.001836    test-rmse:3.049585+0.008558 
## [1651]   train-rmse:1.746325+0.001873    test-rmse:3.049909+0.009003 
## [1701]   train-rmse:1.725838+0.002535    test-rmse:3.050378+0.009136 
## [1751]   train-rmse:1.704581+0.003683    test-rmse:3.050737+0.008637 
## [1801]   train-rmse:1.685065+0.004633    test-rmse:3.050562+0.008606 
## [1851]   train-rmse:1.664703+0.004838    test-rmse:3.051120+0.008569 
## [1901]   train-rmse:1.645825+0.004266    test-rmse:3.051313+0.008286 
## [1951]   train-rmse:1.626805+0.004438    test-rmse:3.052526+0.008683 
## [2000]   train-rmse:1.608829+0.005378    test-rmse:3.053323+0.008541

The evolution of RMSE can then be plotted:

xgbInit_cv$evaluation_log %>%
    select(iter, train=train_rmse_mean, test=test_rmse_mean) %>%
    pivot_longer(-iter, names_to="type", values_to="rmse") %>%
    filter(iter > 10) %>%
    ggplot(aes(x=iter, y=rmse, color=type, group=type)) + 
    geom_line() + 
    scale_color_discrete("") +
    labs(x="# Iterations", 
         y="RMSE of Temperature (F)", 
         title="XGB Model RMSE when trained using Cross Validation (5-fold)", 
         subtitle="plot excludes first 10 iterations"
         )

There is a significant divergence between train RMSE and test RMSE, with test RMSE mostly being optimized already after 500-1000 iterations, while train RMSE continues to improve almost linearly even after 1000 iterations. Unlike random forests, which minimize variance through bootstrap resampling and variable shuffling at each split, the XGB model continues to tune on residuals in the same dataset even after it has learned every pattern that can be generalized to unseen data. Proper use of CV and early-stopping will be important.

Notably, XGB has achieved roughly 3.0 degrees RMSE with the test data while random forest achieved roughly 3.5 degrees with the test data. In part this is driven by the very fast optimization of XGB allowing for many more iterations than can be run in the same time for a random forest.

Potential next steps are to incorporate xgb.cv as an option in the training function, and to explore hyperparameters such as eta (learning rate) and maximum depth, and their impacts on the test set RMSE.

Function xgbRunModel has been updated to accept either xgboost::xgboost or xgboost::xgb.cv. This allows for running several versions of the hyperparameters through the CV process. For example:

# Create hyperparameter space
hypGrid <- expand.grid(eta=c(0.1, 0.3, 0.6), 
                       max_depth=c(3, 6, 10), 
                       nrounds=500, 
                       nfold=5
                       )

# Create containers for results
xgbSmall <- vector("list", nrow(hypGrid))

# Run xgb.cv once for each combination of hyper-parameters
for (rowNum in 1:nrow(hypGrid)) {

    # Extract the relevant hyperparameter data row
    params <- hypGrid[rowNum, ] %>% unlist()
    
    # Run the function for 5-fold CV for a few values of eta and max_depth
    xgbTemp <- xgbRunModel(filter(metarData, !is.na(TempF)), 
                           depVar="TempF", 
                           predVars=baseXGBPreds, 
                           otherVars=c("source", "dtime"), 
                           critFilter=list(locNamefct=multiYearLocales),
                           seed=2008041345,
                           nrounds=params["nrounds"],
                           eta=params["eta"], 
                           max_depth=params["max_depth"],
                           print_every_n=50, 
                           funcRun=xgboost::xgb.cv, 
                           nfold=params["nfold"]
                           )
    
    xgbSmall[[rowNum]] <- list(params=params, results=xgbTemp)
    
}
## [1]  train-rmse:60.291858+0.029840   test-rmse:60.291836+0.122552 
## [51] train-rmse:6.952588+0.022743    test-rmse:6.967671+0.028935 
## [101]    train-rmse:5.911296+0.030230    test-rmse:5.934121+0.031484 
## [151]    train-rmse:5.440318+0.028012    test-rmse:5.468477+0.030610 
## [201]    train-rmse:5.143110+0.026557    test-rmse:5.177588+0.031563 
## [251]    train-rmse:4.938329+0.027705    test-rmse:4.977193+0.034568 
## [301]    train-rmse:4.777454+0.021411    test-rmse:4.820399+0.026625 
## [351]    train-rmse:4.632995+0.018280    test-rmse:4.677907+0.027320 
## [401]    train-rmse:4.504584+0.022960    test-rmse:4.552695+0.030454 
## [451]    train-rmse:4.400923+0.020546    test-rmse:4.452110+0.027658 
## [500]    train-rmse:4.318087+0.017160    test-rmse:4.371721+0.020521 
## [1]  train-rmse:47.414653+0.025543   test-rmse:47.416267+0.110105 
## [51] train-rmse:5.505623+0.039939    test-rmse:5.542213+0.055560 
## [101]    train-rmse:4.837401+0.018193    test-rmse:4.887497+0.028972 
## [151]    train-rmse:4.441178+0.027482    test-rmse:4.499732+0.035334 
## [201]    train-rmse:4.214036+0.027466    test-rmse:4.279759+0.029091 
## [251]    train-rmse:4.061037+0.018717    test-rmse:4.133555+0.019708 
## [301]    train-rmse:3.951799+0.025049    test-rmse:4.028980+0.023754 
## [351]    train-rmse:3.865718+0.023845    test-rmse:3.949121+0.023304 
## [401]    train-rmse:3.794199+0.018816    test-rmse:3.882646+0.017903 
## [451]    train-rmse:3.730279+0.018420    test-rmse:3.823405+0.016648 
## [500]    train-rmse:3.681347+0.016766    test-rmse:3.779265+0.015579 
## [1]  train-rmse:28.598478+0.022489   test-rmse:28.602791+0.088590 
## [51] train-rmse:5.107773+0.055545    test-rmse:5.154449+0.062315 
## [101]    train-rmse:4.415871+0.025289    test-rmse:4.478973+0.021201 
## [151]    train-rmse:4.132172+0.022054    test-rmse:4.211309+0.030999 
## [201]    train-rmse:3.941383+0.022689    test-rmse:4.033375+0.025258 
## [251]    train-rmse:3.801489+0.009797    test-rmse:3.904041+0.018229 
## [301]    train-rmse:3.690903+0.008401    test-rmse:3.806468+0.015299 
## [351]    train-rmse:3.600958+0.010827    test-rmse:3.725559+0.017532 
## [401]    train-rmse:3.528479+0.008935    test-rmse:3.662803+0.017003 
## [451]    train-rmse:3.467037+0.011508    test-rmse:3.609574+0.018976 
## [500]    train-rmse:3.411157+0.012271    test-rmse:3.560714+0.014885 
## [1]  train-rmse:60.209776+0.029499   test-rmse:60.210296+0.117814 
## [51] train-rmse:5.102061+0.012033    test-rmse:5.171496+0.017799 
## [101]    train-rmse:4.363518+0.019358    test-rmse:4.479946+0.015077 
## [151]    train-rmse:4.008150+0.011056    test-rmse:4.166304+0.019110 
## [201]    train-rmse:3.788664+0.005680    test-rmse:3.981695+0.017881 
## [251]    train-rmse:3.622718+0.006958    test-rmse:3.847228+0.020058 
## [301]    train-rmse:3.491297+0.008235    test-rmse:3.742838+0.018430 
## [351]    train-rmse:3.382102+0.008616    test-rmse:3.656639+0.015591 
## [401]    train-rmse:3.288474+0.008511    test-rmse:3.584193+0.017333 
## [451]    train-rmse:3.210271+0.009739    test-rmse:3.525187+0.016061 
## [500]    train-rmse:3.141030+0.008145    test-rmse:3.474401+0.017219 
## [1]  train-rmse:47.134219+0.024267   test-rmse:47.136035+0.096315 
## [51] train-rmse:4.045615+0.025654    test-rmse:4.213756+0.033934 
## [101]    train-rmse:3.535829+0.009401    test-rmse:3.796318+0.014716 
## [151]    train-rmse:3.259137+0.008667    test-rmse:3.586416+0.015300 
## [201]    train-rmse:3.068447+0.007301    test-rmse:3.453034+0.014511 
## [251]    train-rmse:2.920823+0.009054    test-rmse:3.359908+0.012705 
## [301]    train-rmse:2.804430+0.009715    test-rmse:3.295715+0.010669 
## [351]    train-rmse:2.708823+0.006058    test-rmse:3.248712+0.010730 
## [401]    train-rmse:2.624162+0.008032    test-rmse:3.212151+0.012557 
## [451]    train-rmse:2.551436+0.007545    test-rmse:3.184392+0.013666 
## [500]    train-rmse:2.488435+0.005479    test-rmse:3.160857+0.013264 
## [1]  train-rmse:27.817778+0.017617   test-rmse:27.823969+0.065229 
## [51] train-rmse:3.703173+0.029105    test-rmse:3.974862+0.032755 
## [101]    train-rmse:3.196706+0.018067    test-rmse:3.613539+0.022093 
## [151]    train-rmse:2.931391+0.021877    test-rmse:3.464887+0.027318 
## [201]    train-rmse:2.751677+0.016366    test-rmse:3.390184+0.023319 
## [251]    train-rmse:2.616835+0.013893    test-rmse:3.351378+0.021288 
## [301]    train-rmse:2.501442+0.009328    test-rmse:3.325782+0.019926 
## [351]    train-rmse:2.405215+0.005827    test-rmse:3.308294+0.019058 
## [401]    train-rmse:2.324063+0.006549    test-rmse:3.296866+0.018918 
## [451]    train-rmse:2.248729+0.008088    test-rmse:3.289588+0.016957 
## [500]    train-rmse:2.177265+0.006178    test-rmse:3.285350+0.016165 
## [1]  train-rmse:60.177299+0.028919   test-rmse:60.177698+0.114226 
## [51] train-rmse:3.881933+0.016210    test-rmse:4.260229+0.017132 
## [101]    train-rmse:3.137727+0.015027    test-rmse:3.773720+0.026478 
## [151]    train-rmse:2.710896+0.008586    test-rmse:3.534585+0.022632 
## [201]    train-rmse:2.448061+0.002653    test-rmse:3.407226+0.018462 
## [251]    train-rmse:2.251032+0.006251    test-rmse:3.322486+0.015428 
## [301]    train-rmse:2.097336+0.008329    test-rmse:3.267107+0.015011 
## [351]    train-rmse:1.969215+0.011110    test-rmse:3.228843+0.014409 
## [401]    train-rmse:1.862160+0.016369    test-rmse:3.200573+0.013999 
## [451]    train-rmse:1.764049+0.012783    test-rmse:3.179215+0.012618 
## [500]    train-rmse:1.677405+0.012126    test-rmse:3.163679+0.010797 
## [1]  train-rmse:47.019374+0.022382   test-rmse:47.021793+0.085563 
## [51] train-rmse:2.801653+0.022126    test-rmse:3.675591+0.026313 
## [101]    train-rmse:2.159013+0.010218    test-rmse:3.419520+0.012152 
## [151]    train-rmse:1.800193+0.012705    test-rmse:3.341851+0.011634 
## [201]    train-rmse:1.550361+0.009200    test-rmse:3.313317+0.010477 
## [251]    train-rmse:1.359007+0.013033    test-rmse:3.300204+0.010413 
## [301]    train-rmse:1.199933+0.009052    test-rmse:3.295912+0.009275 
## [351]    train-rmse:1.071990+0.011760    test-rmse:3.296180+0.007863 
## [401]    train-rmse:0.963536+0.007773    test-rmse:3.296334+0.007524 
## [451]    train-rmse:0.865972+0.006748    test-rmse:3.297515+0.008136 
## [500]    train-rmse:0.780306+0.006435    test-rmse:3.299618+0.007300 
## [1]  train-rmse:27.480704+0.012962   test-rmse:27.490717+0.044583 
## [51] train-rmse:2.346123+0.023517    test-rmse:3.765403+0.026814 
## [101]    train-rmse:1.710224+0.021395    test-rmse:3.707509+0.029322 
## [151]    train-rmse:1.339575+0.008671    test-rmse:3.711507+0.028175 
## [201]    train-rmse:1.058404+0.007993    test-rmse:3.724789+0.028353 
## [251]    train-rmse:0.846464+0.012995    test-rmse:3.737257+0.026698 
## [301]    train-rmse:0.675616+0.010724    test-rmse:3.745457+0.024722 
## [351]    train-rmse:0.552008+0.013427    test-rmse:3.751203+0.023449 
## [401]    train-rmse:0.446647+0.009868    test-rmse:3.757080+0.022794 
## [451]    train-rmse:0.364546+0.009564    test-rmse:3.761381+0.022925 
## [500]    train-rmse:0.298650+0.006166    test-rmse:3.764331+0.022158

The test and train RMSE by iteration can be extracted and plotted by combination of parameters:

# Extract the key parameters as a character vector
allParams <- sapply(xgbSmall, FUN=function(x) x[["params"]])
keyParamVec <- allParams %>% 
    apply(2, FUN=function(x) paste0(names(x)[1], ": ", x[1], ", ", names(x)[2], ": ", x[2]))

# Extract RMSE and attach character vector
allRMSE <- map_dfr(xgbSmall, 
                   .f=function(x) x[["results"]][["xgbModel"]]$evaluation_log, 
                   .id="source"
                   ) %>%
    mutate(desc=factor(keyParamVec[as.numeric(source)], levels=keyParamVec)) %>%
    tibble::as_tibble()

# Get the minimum test RMSE
minTestRMSE <- allRMSE %>% 
    select(test_rmse_mean) %>% 
    min()

# Create plot for final test RMSE
allRMSE %>%
    select(desc, iter, train=train_rmse_mean, test=test_rmse_mean) %>%
    filter(iter==max(iter)) %>%
    ggplot(aes(x=desc, y=test)) + 
    geom_col(fill="lightblue") + 
    geom_text(aes(y=test/2, label=round(test, 1))) + 
    labs(x="", y="Test RMSE after 500 iterations", title="Test RMSE at 500 iterations by hyperparameters") +
    coord_flip()

# Create plot for RMSE evolution
allRMSE %>%
    select(desc, iter, train=train_rmse_mean, test=test_rmse_mean) %>%
    pivot_longer(-c(desc, iter), names_to="type", values_to="RMSE") %>%
    filter(RMSE <= 8) %>%
    ggplot(aes(x=iter, y=RMSE, group=type, color=type)) + 
    geom_line() + 
    labs(x="# Iterations", title="RMSE Evolution by Hyperparameters", subtitle="Only RMSE <= 8 plotted") +
    geom_hline(aes(yintercept=minTestRMSE), color="red", lty=2) +
    geom_vline(aes(xintercept=10), lty=2) +
    facet_wrap(~desc)

As expected, increasing eta and max_depth have a tendency to induce over-fitting while also driving to the optimal RMSE quicker. A trade-off. The default eta=0.3 and max_depth=6 appear to be close to the minimum RMSE at 500 observations with a rather modest delta between test/train RMSE. The high over-fit model (eta 0.6 and max_depth 10) appears to have converged with a high test RMSE. The slowest model (eta 0.1, max_depth 3) appears to still have significant room to learn even after 500 observations.

The top performing models at 500 iterations appear to be blends of parameters that average out to “moderate-high” learning ability (low eta=0.1 with high max_depth=10, medium eta=0.3 with medium max_depth=6, medium eta=0.3 with high max_depth=10, high eta=0.6 with medium max_depth=6).

Suppose the slowest model is run for 2500 iterations to check on convergence:

# Define key predictor variables for base XGB runs
baseXGBPreds <- c("locNamefct", "month", "hrfct", 
                  "DewF", "modSLP", "Altimeter", "WindSpeed", 
                  "predomDir", "minHeight", "ceilingHeight"
                  )

# Core multi-year cities
multiYearLocales <- c("Las Vegas, NV", "New Orleans, LA", "Chicago, IL", "San Diego, CA")

# Run the function shell
xgbSlow <- xgbRunModel(filter(metarData, !is.na(TempF)), 
                       depVar="TempF", 
                       predVars=baseXGBPreds, 
                       otherVars=c("source", "dtime"), 
                       critFilter=list(locNamefct=multiYearLocales),
                       seed=2008041432,
                       nrounds=2500,
                       print_every_n=50, 
                       eta=0.1, 
                       max_depth=3
                       )
## [1]  train-rmse:60.254131 
## [51] train-rmse:6.966332 
## [101]    train-rmse:5.919064 
## [151]    train-rmse:5.452974 
## [201]    train-rmse:5.163146 
## [251]    train-rmse:4.979972 
## [301]    train-rmse:4.815380 
## [351]    train-rmse:4.661036 
## [401]    train-rmse:4.540689 
## [451]    train-rmse:4.424547 
## [501]    train-rmse:4.343824 
## [551]    train-rmse:4.269697 
## [601]    train-rmse:4.204936 
## [651]    train-rmse:4.158858 
## [701]    train-rmse:4.111222 
## [751]    train-rmse:4.061322 
## [801]    train-rmse:4.019222 
## [851]    train-rmse:3.981936 
## [901]    train-rmse:3.936117 
## [951]    train-rmse:3.910823 
## [1001]   train-rmse:3.884617 
## [1051]   train-rmse:3.862150 
## [1101]   train-rmse:3.838576 
## [1151]   train-rmse:3.815784 
## [1201]   train-rmse:3.794541 
## [1251]   train-rmse:3.777628 
## [1301]   train-rmse:3.758889 
## [1351]   train-rmse:3.743135 
## [1401]   train-rmse:3.727739 
## [1451]   train-rmse:3.709213 
## [1501]   train-rmse:3.688144 
## [1551]   train-rmse:3.676046 
## [1601]   train-rmse:3.661273 
## [1651]   train-rmse:3.644699 
## [1701]   train-rmse:3.625214 
## [1751]   train-rmse:3.612606 
## [1801]   train-rmse:3.600801 
## [1851]   train-rmse:3.581740 
## [1901]   train-rmse:3.570685 
## [1951]   train-rmse:3.559159 
## [2001]   train-rmse:3.544988 
## [2051]   train-rmse:3.535298 
## [2101]   train-rmse:3.526862 
## [2151]   train-rmse:3.515605 
## [2201]   train-rmse:3.503624 
## [2251]   train-rmse:3.493750 
## [2301]   train-rmse:3.483091 
## [2351]   train-rmse:3.475146 
## [2401]   train-rmse:3.467263 
## [2451]   train-rmse:3.458688 
## [2500]   train-rmse:3.450731

The evalutaion functions can then be run:

# Plot the evolution
xgbSlowEvolution <- plotXGBTrainEvolution(xgbSlow, fullSD="TempF", plot_every=100)

# Assess performance on test data
plotXGBTestData(xgbSlow, depVar="TempF", reportBy=c("locNamefct", "month"))
## 
## OVERALL PREDICTIVE PERFORMANCE:
## 
## # A tibble: 1 x 3
##   rmse_orig rmse_xgb   rsq
##       <dbl>    <dbl> <dbl>
## 1      18.0     3.54 0.961
## 
## 
## 
## PREDICTIVE PERFORMANCE BY GROUP(S):
## 
## # A tibble: 4 x 4
##   locNamefct      rmse_orig rmse_xgb   rsq
##   <fct>               <dbl>    <dbl> <dbl>
## 1 Chicago, IL         21.1      4.01 0.964
## 2 Las Vegas, NV       18.2      3.29 0.967
## 3 New Orleans, LA     13.4      3.47 0.933
## 4 San Diego, CA        7.35     3.32 0.796
## 
## # A tibble: 12 x 4
##    month rmse_orig rmse_xgb   rsq
##    <fct>     <dbl>    <dbl> <dbl>
##  1 Jan        17.2     3.42 0.960
##  2 Feb        17.7     3.73 0.956
##  3 Mar        14.8     3.70 0.937
##  4 Apr        12.5     4.00 0.898
##  5 May        11.4     3.78 0.890
##  6 Jun        12.0     3.33 0.923
##  7 Jul        11.1     3.06 0.924
##  8 Aug        10.3     2.88 0.922
##  9 Sep        10.2     3.38 0.890
## 10 Oct        11.6     3.72 0.896
## 11 Nov        14.0     3.86 0.924
## 12 Dec        14.2     3.42 0.942
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Even at 2500 iterations, the model has almost no over-fitting (train RMSE 3.5, test RMSE 3.5). However, the model appears to be converging at several tenths of a degree higher test RMSE than some of the other models achieve in as little as 500 iterations.

XGB classification of locales for 2016

The XGB approach can also be followed for classifications, by passing a factor variable for y and updating the objective function and metric to be appropriate for classification. The general recipe is similar to what is followed for regression.

To begin, a simple classification will be run using 2016 data to assess “is this data from Las Vegas?”. Las Vegas is selected because it is largely distinct from the other climate types. For further simplification of the first pass, Phoenix data will be deleted, and the “all other” cities will be reduced to being the same data volume as Las Vegas:

set.seed(2008051312)

# Extract a balanced sample of Las Vegas 2016 data and all-other 2016 data (excluding Phoenix)
las2016Data <- metarData %>%
    filter(year==2016, !is.na(TempF), !(locNamefct %in% c("Phoenix, AZ"))) %>%
    mutate(isLAS=factor(ifelse(locNamefct=="Las Vegas, NV", "Las Vegas", "All Other"), 
                        levels=c("Las Vegas", "All Other")
                        ), 
           nLAS=sum(isLAS=="Las Vegas")
           ) %>%
    group_by(isLAS) %>%
    sample_n(min(nLAS)) %>% # not ideal coding, but it works
    ungroup()

First, the random forest approach is run on the Las Vegas 2016 data:

# Define key predictor variables for base XGB runs
locXGBPreds <- c("month", "hrfct", 
                 "TempF", "DewF", "modSLP", "Altimeter", "WindSpeed", 
                 "predomDir", "minHeight", "ceilingHeight"
                 )

# Run random forest for Las Vegas 2016 classifications
rf_las2016 <- rfMultiLocale(las2016Data, 
                            vrbls=locXGBPreds,
                            otherVar=keepVarFull,
                            locs=NULL, 
                            locVar="isLAS",
                            pred="isLAS",
                            ntree=100, 
                            seed=2008051316, 
                            mtry=4
                            )
## 
## Running for locations:
## [1] "Las Vegas" "All Other"

The variable importances and quality of the predictions can then be assessed:

# Plot variable importances
helperPlotVarImp(rf_las2016$rfModel)

# Evaluate prediction accuracy
evalPredictions(rf_las2016, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP, Altimeter", 
                keyVar="isLAS"
                )

## # A tibble: 4 x 5
##   locale    predicted correct     n    pct
##   <fct>     <fct>     <lgl>   <int>  <dbl>
## 1 Las Vegas Las Vegas TRUE     2496 0.952 
## 2 Las Vegas All Other FALSE     125 0.0477
## 3 All Other Las Vegas FALSE     108 0.0412
## 4 All Other All Other TRUE     2512 0.959

Using majority-rules classification, the model achieves balanced 96% accuracy in classifying locales as being Las Vegas or All Other. The main variables that assist in the classification are dew point, minimum cloud height, and temperature. This seems plausible as Las Vegas is rarely cloudy (and almost never at low levels) and shows a significantly different density on a dewpoint/temperature plot than most other locales:

plotOrder <- rev(levels(rf_las2016$testData$minHeight))

p1_las2016 <- rf_las2016$testData %>%
    ggplot(aes(x=isLAS, fill=factor(minHeight, levels=plotOrder))) + 
    geom_bar(position="fill") + 
    coord_flip() + 
    labs(x="", y="Percentage of observations", title="Minimum cloud heights") + 
    scale_fill_discrete("Min Cloud Height", guide=guide_legend(reverse=TRUE, nrow=2, byrow=TRUE), 
                        labels=plotOrder, breaks=plotOrder
                        ) + 
    theme(legend.position="bottom")

p2_las2016 <- rf_las2016$testData %>%
    ggplot(aes(x=TempF, y=DewF)) + 
    geom_point(alpha=0.1) + 
    labs(x="Temperature (F)", y="Dew Point (F)", title="Temp/Dew Point Distribution") + 
    facet_wrap(~isLAS)

gridExtra::grid.arrange(p1_las2016, p2_las2016, nrow=1)

The modesl has 80%+ votes in one direction or ther other about 80% of the time. These classifications are ~99% accurate. When the model is less confident, the prediction quality declines to the 70%-80% range:

# Evaluate prediction certainty
probs_las2016 <- 
    assessPredictionCertainty(rf_las2016, 
                              keyVar="isLAS", 
                              plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind, SLP, Altimeter", 
                              showAcc=TRUE
                              )

The process is then run using xgb, with the following modifications:

  1. Convert the isLAS variable to 0/1
  2. Update the objective function to be logistic

First, the data are reshaped and the CV process is run to see where the test error stabilizes:

las2016Data <- las2016Data %>%
    mutate(binLAS=ifelse(isLAS=="Las Vegas", 1, 0))
# Run the function shell
xgb_las2016_cv <- xgbRunModel(las2016Data, 
                              depVar="binLAS", 
                              predVars=locXGBPreds, 
                              otherVars=keepVarFull, 
                              seed=2008051405,
                              nrounds=1000,
                              print_every_n=50, 
                              xgbObjective="binary:logistic", 
                              funcRun=xgboost::xgb.cv, 
                              nfold=5
                              )
## [1]  train-error:0.089480+0.002471   test-error:0.100826+0.005528 
## [51] train-error:0.011428+0.001170   test-error:0.038433+0.002412 
## [101]    train-error:0.002208+0.000704   test-error:0.031809+0.002303 
## [151]    train-error:0.000470+0.000153   test-error:0.030664+0.002807 
## [201]    train-error:0.000020+0.000041   test-error:0.029602+0.001874 
## [251]    train-error:0.000000+0.000000   test-error:0.028784+0.002404 
## [301]    train-error:0.000000+0.000000   test-error:0.028866+0.002539 
## [351]    train-error:0.000000+0.000000   test-error:0.028293+0.001870 
## [401]    train-error:0.000000+0.000000   test-error:0.027721+0.002242 
## [451]    train-error:0.000000+0.000000   test-error:0.027721+0.002799 
## [501]    train-error:0.000000+0.000000   test-error:0.027721+0.002301 
## [551]    train-error:0.000000+0.000000   test-error:0.027476+0.002811 
## [601]    train-error:0.000000+0.000000   test-error:0.027312+0.002950 
## [651]    train-error:0.000000+0.000000   test-error:0.027230+0.003128 
## [701]    train-error:0.000000+0.000000   test-error:0.027557+0.002997 
## [751]    train-error:0.000000+0.000000   test-error:0.027148+0.003020 
## [801]    train-error:0.000000+0.000000   test-error:0.027721+0.002627 
## [851]    train-error:0.000000+0.000000   test-error:0.027557+0.002513 
## [901]    train-error:0.000000+0.000000   test-error:0.027230+0.002766 
## [951]    train-error:0.000000+0.000000   test-error:0.027230+0.002604 
## [1000]   train-error:0.000000+0.000000   test-error:0.027148+0.002500

Error rate evolution can be plotted:

# Calculate minimum test error
minTestError <- min(xgb_las2016_cv$xgbModel$evaluation_log$test_error_mean)

# Create plot for RMSE evolution
xgb_las2016_cv$xgbModel$evaluation_log %>%
    select(-contains("std")) %>%
    select(iter, train=train_error_mean, test=test_error_mean) %>%
    pivot_longer(-c(iter), names_to="type", values_to="RMSE") %>%
    ggplot(aes(x=iter, y=RMSE, group=type, color=type)) + 
    geom_line() + 
    geom_hline(aes(yintercept=minTestError), color="red", lty=2) +
    labs(x="# Iterations", title="RMSE Evolution")

Test error appears to be minimized in the 250-500 iterations range, while train error has been driven to zero within the first few hundred iterations. A full XGB process is run using 500 iterations:

# Run the function shell
xgb_las2016 <- xgbRunModel(las2016Data, 
                           depVar="binLAS", 
                           predVars=locXGBPreds, 
                           otherVars=keepVarFull, 
                           seed=2008051405,
                           nrounds=500,
                           print_every_n=25, 
                           xgbObjective="binary:logistic"
                           )
## [1]  train-error:0.096001 
## [26] train-error:0.025513 
## [51] train-error:0.012429 
## [76] train-error:0.006215 
## [101]    train-error:0.002535 
## [126]    train-error:0.001554 
## [151]    train-error:0.000409 
## [176]    train-error:0.000082 
## [201]    train-error:0.000000 
## [226]    train-error:0.000000 
## [251]    train-error:0.000000 
## [276]    train-error:0.000000 
## [301]    train-error:0.000000 
## [326]    train-error:0.000000 
## [351]    train-error:0.000000 
## [376]    train-error:0.000000 
## [401]    train-error:0.000000 
## [426]    train-error:0.000000 
## [451]    train-error:0.000000 
## [476]    train-error:0.000000 
## [500]    train-error:0.000000

Accuracy of the classifications can then be assessed by investigating the “predicted” column of the testData frame:

# Overall classification probability histogram
xgb_las2016$testData %>%
    ggplot(aes(x=predicted)) + 
    geom_histogram(fill="lightblue") + 
    labs(x="Predicted Probability of Las Vegas", y="", 
         title="Prediction Probabilities - Las Vegas vs. Other"
         )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Overall accuracy by bin
xgb_las2016$testData %>%
    mutate(binPred=ifelse(predicted>=0.5, 1, 0)) %>%
    group_by(binLAS) %>%
    summarize(accuracy=mean(binLAS==binPred))
## # A tibble: 2 x 2
##   binLAS accuracy
##    <dbl>    <dbl>
## 1      0    0.964
## 2      1    0.986
# Accuracy by predicted probability (whether certainty is at least 80%)
xgb_las2016$testData %>%
    mutate(binPred=ifelse(predicted>=0.5, 1, 0), 
           confPred=ifelse(abs(predicted-0.5)>=0.3, "High", "Low")
           ) %>%
    group_by(confPred, binLAS) %>%
    summarize(accuracy=mean(binLAS==binPred), n=n())
## # A tibble: 4 x 4
## # Groups:   confPred [2]
##   confPred binLAS accuracy     n
##   <chr>     <dbl>    <dbl> <int>
## 1 High          0    0.975  2608
## 2 High          1    0.994  2485
## 3 Low           0    0.558    77
## 4 Low           1    0.690    71

Overall accuracy is in the 97% range, roughly 1% higher than is achieved using the random forest algorithm. The XGB model is slightly more likely to make a high-confidence error in classifying a non-Las Vegas locale as Las Vegas, and much less likely to have a low confidence in its predictions.

Variable importance can also be assessed:

# Find and plot importances
xgb_las2016_importance <- plotXGBImportance(xgb_las2016, 
                                            featureStems=locXGBPreds, 
                                            stemMapper = varMapper, 
                                            plotTitle="Gain by variable in xgboost", 
                                            plotSubtitle="Modeling 2016 Locale (Las Vegas vs, Other)"
                                            )
## # A tibble: 10 x 4
##    Feature          Gain  Cover Frequency
##    <chr>           <dbl>  <dbl>     <dbl>
##  1 month         0.0550  0.115     0.0853
##  2 hrfct         0.0122  0.128     0.0596
##  3 TempF         0.264   0.148     0.156 
##  4 DewF          0.361   0.166     0.160 
##  5 modSLP        0.109   0.170     0.196 
##  6 Altimeter     0.0552  0.104     0.141 
##  7 WindSpeed     0.0638  0.0572    0.104 
##  8 predomDir     0.0219  0.0483    0.0415
##  9 minHeight     0.0507  0.0367    0.0374
## 10 ceilingHeight 0.00682 0.0271    0.0193

At a glance, the XGB algorithm makes high use of dew point and temperature, consistent with the random forest algorithm. In contrast, the XGB algorithm prefers to use sea-level pressure next while the random forest model prefers to use minimum cloud height next.

Lastly, predictions are plotted against actual values:

xgb_las2016$testData %>%
    mutate(rndPred=round(2*predicted, 1)/2) %>%
    group_by(rndPred) %>%
    summarize(n=n(), meanPred=mean(binLAS)) %>%
    ggplot(aes(x=rndPred, y=meanPred)) + 
    geom_point(aes(size=n)) + 
    geom_abline(lty=2, color="red") + 
    geom_text(aes(y=meanPred+0.1, label=paste0((round(meanPred, 2)), "\n(n=", n, ")")), size=3) + 
    labs(x="Predicted Probability Las Vegas", y="Actual Proportion Las Vegas")

Broadly, there is a strong association between the predicted probabilities and the actual proportions. As noted previously, almost all of the predictions are correctly made with high confidence.

Next steps are to explore some trickier single-locale classifications and then to explore multi-locale classifications.

Suppose that every locale is run through the process of self vs. other, with other under-sampled to be the same size as self, using 500 rounds:

# Function to run one vs all using XGB
helperXGBOnevAll <- function(df, 
                             keyLoc, 
                             critFilter=vector("list", 0),
                             underSample=TRUE,
                             predVars=locXGBPreds, 
                             otherVars=keepVarFull, 
                             seed=NULL, 
                             nrounds=500, 
                             print_every_n=100, 
                             xgbObjective="binary:logistic", 
                             ...
                             ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame or tibble
    # keyLoc: the value of locNamefct to use as the 'one' for 'one' vs. all-other
    # critFilter: named list, of format name=values, where filtering will be (get(name) %in% values)
    # underSample: boolean, if TRUE take 'all other' and randomly under-sample to be the same size as keyLoc
    # predVars: explanatory variables for modeling
    # otherVars: other variables to be kept, but not used in modeling
    # seed: the random seed (NULL means no seed)
    # nrounds: the number of XGB training rounds
    # print_every_n: the frequency of printing the XGB training error
    # xgbObjective: the objective function to be used in XGB modeling
    # ...: any other arguments to be passed (eventually) to xgboost::xgboost
    
    # Set the seed if it has been passed (drives consistency of under-sampling)
    if (!is.null(seed)) set.seed(seed)
    
    # Generate the descriptive name for keyLoc
    descName <- str_replace(keyLoc, pattern=", \\w{2}$", replacement="")
    
    # Announce the locale being run
    cat("\n\n **************")
    cat("\nRunning for", keyLoc, "with decription", descName, "\n")
    
    # Filter for non-NA data across all of 'locNamefct', predVars, otherVars, names(critFilter) are included
    subDF <- df %>%
        filter_at(vars(all_of(c("locNamefct", predVars, otherVars, names(critFilter)))), 
                  all_vars(!is.na(.))
                  )
    
    # Filter such that only matches to critFilter are included
    for (xNum in seq_len(length(critFilter))) {
        subDF <- subDF %>%
            filter_at(vars(all_of(names(critFilter)[xNum])), ~. %in% critFilter[[xNum]])
    }

    # Create the variables needed for modeling
    subDF <- subDF %>%
        mutate(isKEY=factor(ifelse(locNamefct==keyLoc, descName, "All Other"), 
                            levels=c(descName, "All Other")
                            ), 
               nKEY=sum(isKEY==descName), 
               binKEY=ifelse(isKEY==descName, 1, 0)
               )
    
    # Extract a balanced sample of data matching keyLoc and all-other data, if underSample==TRUE
    if (isTRUE(underSample)) {
        subDF <- subDF %>%
            group_by(isKEY) %>%
            sample_n(min(nKEY)) %>% # not ideal coding, but it works
            ungroup()
    }

    # Run the function shell
    outData <- xgbRunModel(subDF, 
                           depVar="binKEY", 
                           predVars=predVars, 
                           otherVars=otherVars, 
                           seed=seed,
                           nrounds=nrounds,
                           print_every_n=print_every_n, 
                           xgbObjective=xgbObjective, 
                           ...
                           )
    
    # Calculate overall accuracy
    accOverall <- outData$testData %>%
        mutate(correct=round(predicted)==binKEY) %>%
        pull(correct) %>%
        mean()
    
    # Report on finishing
    cat("Finished processing", keyLoc, "with overall test set accuracy", round(accOverall, 3), "\n")
    
    # Return the list
    outData
    
}

The function is then run for Las Vegas, including all 2016 cities:

xgb_las2016_check <- helperXGBOnevAll(metarData, 
                                      keyLoc="Las Vegas, NV", 
                                      critFilter=list(year=2016), 
                                      seed=2008061322
                                      )
## 
## 
##  **************
## Running for Las Vegas, NV with decription Las Vegas 
## [1]  train-error:0.104506 
## [101]    train-error:0.007196 
## [201]    train-error:0.000409 
## [301]    train-error:0.000000 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing Las Vegas, NV with overall test set accuracy 0.965
# Report accuracy without Phoenix
xgb_las2016_check$testData %>%
    filter(locNamefct != "Phoenix, AZ") %>%
    mutate(correct=round(predicted)==binKEY) %>%
    pull(correct) %>%
    mean() %>%
    round(3)
## [1] 0.976

Overall test set accuracy as reported is 96.3%. Excluding Phoenix, which will often classify as Las Vegas and was excluded in the previous run, accuracy is roughly 97% as before.

Suppose that every locale is run, with the objective of finding cities that are most and least unique:

# Extract the locales that are available in 2016
locs2016 <- metarData %>%
    filter(year==2016) %>%
    pull(locNamefct) %>%
    as.character() %>%
    unique() %>%
    sort()

# Create a container to hold each of the results
xgbOnevAll <- vector("list", length=length(locs2016))
names(xgbOnevAll) <- locs2016

# Run through the XGB process
n <- 1
for (loc in locs2016) {

    # Run and save the XGB model
    xgbOnevAll[[n]] <- helperXGBOnevAll(metarData, 
                                        keyLoc=loc, 
                                        critFilter=list(year=2016), 
                                        seed=2008061340+n
                                        )
    
    # Index the counter
    n <- n + 1
    
}
## 
## 
##  **************
## Running for Atlanta, GA with decription Atlanta 
## [1]  train-error:0.323565 
## [101]    train-error:0.043099 
## [201]    train-error:0.013142 
## [301]    train-error:0.003020 
## [401]    train-error:0.000327 
## [500]    train-error:0.000082 
## Finished processing Atlanta, GA with overall test set accuracy 0.902 
## 
## 
##  **************
## Running for Boston, MA with decription Boston 
## [1]  train-error:0.312644 
## [101]    train-error:0.066854 
## [201]    train-error:0.020386 
## [301]    train-error:0.005447 
## [401]    train-error:0.001073 
## [500]    train-error:0.000083 
## Finished processing Boston, MA with overall test set accuracy 0.892 
## 
## 
##  **************
## Running for Chicago, IL with decription Chicago 
## [1]  train-error:0.386542 
## [101]    train-error:0.097308 
## [201]    train-error:0.042577 
## [301]    train-error:0.014356 
## [401]    train-error:0.005220 
## [500]    train-error:0.001223 
## Finished processing Chicago, IL with overall test set accuracy 0.798 
## 
## 
##  **************
## Running for Dallas, TX with decription Dallas 
## [1]  train-error:0.276697 
## [101]    train-error:0.035487 
## [201]    train-error:0.010466 
## [301]    train-error:0.003352 
## [401]    train-error:0.000572 
## [500]    train-error:0.000082 
## Finished processing Dallas, TX with overall test set accuracy 0.911 
## 
## 
##  **************
## Running for Denver, CO with decription Denver 
## [1]  train-error:0.149882 
## [101]    train-error:0.004329 
## [201]    train-error:0.000000 
## [301]    train-error:0.000000 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing Denver, CO with overall test set accuracy 0.972 
## 
## 
##  **************
## Running for Detroit, MI with decription Detroit 
## [1]  train-error:0.373577 
## [101]    train-error:0.093210 
## [201]    train-error:0.041363 
## [301]    train-error:0.017119 
## [401]    train-error:0.006225 
## [500]    train-error:0.001884 
## Finished processing Detroit, MI with overall test set accuracy 0.802 
## 
## 
##  **************
## Running for Grand Rapids, MI with decription Grand Rapids 
## [1]  train-error:0.367789 
## [101]    train-error:0.090127 
## [201]    train-error:0.036994 
## [301]    train-error:0.013573 
## [401]    train-error:0.003890 
## [500]    train-error:0.001241 
## Finished processing Grand Rapids, MI with overall test set accuracy 0.795 
## 
## 
##  **************
## Running for Green Bay, WI with decription Green Bay 
## [1]  train-error:0.299343 
## [101]    train-error:0.054807 
## [201]    train-error:0.019638 
## [301]    train-error:0.006409 
## [401]    train-error:0.001397 
## [500]    train-error:0.000493 
## Finished processing Green Bay, WI with overall test set accuracy 0.854 
## 
## 
##  **************
## Running for Houston, TX with decription Houston 
## [1]  train-error:0.222458 
## [101]    train-error:0.045521 
## [201]    train-error:0.014384 
## [301]    train-error:0.003432 
## [401]    train-error:0.001144 
## [500]    train-error:0.000327 
## Finished processing Houston, TX with overall test set accuracy 0.913 
## 
## 
##  **************
## Running for Indianapolis, IN with decription Indianapolis 
## [1]  train-error:0.375595 
## [101]    train-error:0.089470 
## [201]    train-error:0.030999 
## [301]    train-error:0.011481 
## [401]    train-error:0.003280 
## [500]    train-error:0.000656 
## Finished processing Indianapolis, IN with overall test set accuracy 0.809 
## 
## 
##  **************
## Running for Las Vegas, NV with decription Las Vegas 
## [1]  train-error:0.098127 
## [101]    train-error:0.006215 
## [201]    train-error:0.000327 
## [301]    train-error:0.000000 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing Las Vegas, NV with overall test set accuracy 0.963 
## 
## 
##  **************
## Running for Lincoln, NE with decription Lincoln 
## [1]  train-error:0.287360 
## [101]    train-error:0.060462 
## [201]    train-error:0.016423 
## [301]    train-error:0.004821 
## [401]    train-error:0.001307 
## [500]    train-error:0.000082 
## Finished processing Lincoln, NE with overall test set accuracy 0.869 
## 
## 
##  **************
## Running for Los Angeles, CA with decription Los Angeles 
## [1]  train-error:0.171377 
## [101]    train-error:0.026385 
## [201]    train-error:0.007678 
## [301]    train-error:0.001960 
## [401]    train-error:0.000327 
## [500]    train-error:0.000082 
## Finished processing Los Angeles, CA with overall test set accuracy 0.934 
## 
## 
##  **************
## Running for Madison, WI with decription Madison 
## [1]  train-error:0.365856 
## [101]    train-error:0.082205 
## [201]    train-error:0.032716 
## [301]    train-error:0.013369 
## [401]    train-error:0.004069 
## [500]    train-error:0.001661 
## Finished processing Madison, WI with overall test set accuracy 0.809 
## 
## 
##  **************
## Running for Miami, FL with decription Miami 
## [1]  train-error:0.125469 
## [101]    train-error:0.014449 
## [201]    train-error:0.002449 
## [301]    train-error:0.000163 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing Miami, FL with overall test set accuracy 0.96 
## 
## 
##  **************
## Running for Milwaukee, WI with decription Milwaukee 
## [1]  train-error:0.361758 
## [101]    train-error:0.089766 
## [201]    train-error:0.044025 
## [301]    train-error:0.017643 
## [401]    train-error:0.006453 
## [500]    train-error:0.001552 
## Finished processing Milwaukee, WI with overall test set accuracy 0.798 
## 
## 
##  **************
## Running for Minneapolis, MN with decription Minneapolis 
## [1]  train-error:0.354010 
## [101]    train-error:0.072302 
## [201]    train-error:0.027796 
## [301]    train-error:0.008559 
## [401]    train-error:0.001549 
## [500]    train-error:0.000245 
## Finished processing Minneapolis, MN with overall test set accuracy 0.842 
## 
## 
##  **************
## Running for New Orleans, LA with decription New Orleans 
## [1]  train-error:0.202465 
## [101]    train-error:0.008813 
## [201]    train-error:0.002040 
## [301]    train-error:0.000245 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing New Orleans, LA with overall test set accuracy 0.967 
## 
## 
##  **************
## Running for Newark, NJ with decription Newark 
## [1]  train-error:0.327080 
## [101]    train-error:0.072186 
## [201]    train-error:0.031403 
## [301]    train-error:0.011664 
## [401]    train-error:0.002855 
## [500]    train-error:0.000571 
## Finished processing Newark, NJ with overall test set accuracy 0.85 
## 
## 
##  **************
## Running for Philadelphia, PA with decription Philadelphia 
## [1]  train-error:0.363911 
## [101]    train-error:0.072326 
## [201]    train-error:0.032208 
## [301]    train-error:0.012720 
## [401]    train-error:0.004077 
## [500]    train-error:0.001631 
## Finished processing Philadelphia, PA with overall test set accuracy 0.844 
## 
## 
##  **************
## Running for Phoenix, AZ with decription Phoenix 
## [1]  train-error:0.109726 
## [101]    train-error:0.006069 
## [201]    train-error:0.000902 
## [301]    train-error:0.000164 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing Phoenix, AZ with overall test set accuracy 0.965 
## 
## 
##  **************
## Running for Saint Louis, MO with decription Saint Louis 
## [1]  train-error:0.379026 
## [101]    train-error:0.064009 
## [201]    train-error:0.020677 
## [301]    train-error:0.007332 
## [401]    train-error:0.002059 
## [500]    train-error:0.000165 
## Finished processing Saint Louis, MO with overall test set accuracy 0.864 
## 
## 
##  **************
## Running for San Antonio, TX with decription San Antonio 
## [1]  train-error:0.225978 
## [101]    train-error:0.001551 
## [201]    train-error:0.000000 
## [301]    train-error:0.000000 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing San Antonio, TX with overall test set accuracy 0.982 
## 
## 
##  **************
## Running for San Diego, CA with decription San Diego 
## [1]  train-error:0.152268 
## [101]    train-error:0.014630 
## [201]    train-error:0.002289 
## [301]    train-error:0.000409 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing San Diego, CA with overall test set accuracy 0.949 
## 
## 
##  **************
## Running for San Francisco, CA with decription San Francisco 
## [1]  train-error:0.159878 
## [101]    train-error:0.021888 
## [201]    train-error:0.005678 
## [301]    train-error:0.001234 
## [401]    train-error:0.000165 
## [500]    train-error:0.000000 
## Finished processing San Francisco, CA with overall test set accuracy 0.929 
## 
## 
##  **************
## Running for San Jose, CA with decription San Jose 
## [1]  train-error:0.159951 
## [101]    train-error:0.035644 
## [201]    train-error:0.011501 
## [301]    train-error:0.003344 
## [401]    train-error:0.000489 
## [500]    train-error:0.000163 
## Finished processing San Jose, CA with overall test set accuracy 0.915 
## 
## 
##  **************
## Running for Seattle, WA with decription Seattle 
## [1]  train-error:0.216713 
## [101]    train-error:0.004166 
## [201]    train-error:0.000163 
## [301]    train-error:0.000000 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing Seattle, WA with overall test set accuracy 0.975 
## 
## 
##  **************
## Running for Tampa Bay, FL with decription Tampa Bay 
## [1]  train-error:0.184260 
## [101]    train-error:0.025553 
## [201]    train-error:0.007021 
## [301]    train-error:0.002123 
## [401]    train-error:0.000408 
## [500]    train-error:0.000000 
## Finished processing Tampa Bay, FL with overall test set accuracy 0.923 
## 
## 
##  **************
## Running for Traverse City, MI with decription Traverse City 
## [1]  train-error:0.277315 
## [101]    train-error:0.061163 
## [201]    train-error:0.021966 
## [301]    train-error:0.007921 
## [401]    train-error:0.002613 
## [500]    train-error:0.000327 
## Finished processing Traverse City, MI with overall test set accuracy 0.846 
## 
## 
##  **************
## Running for Washington, DC with decription Washington 
## [1]  train-error:0.335551 
## [101]    train-error:0.067258 
## [201]    train-error:0.026443 
## [301]    train-error:0.010512 
## [401]    train-error:0.002628 
## [500]    train-error:0.000575 
## Finished processing Washington, DC with overall test set accuracy 0.865

Overall accuracy by locale can then be assessed:

# Function to calculate and extract overall accuracy from the XGB list
helperXGBListAccuracyOverall <- function(lst) {
    
    lst[["testData"]] %>%
        select(binKEY, predicted) %>%
        mutate(correct=binKEY==round(predicted)) %>%
        pull(correct) %>%
        mean()
    
}

# Extract overall accuracy by locale and format as a tibble
accList <- sapply(xgbOnevAll, FUN=helperXGBListAccuracyOverall) %>% 
    as.data.frame() %>% 
    rename("accOverall"=".") %>% 
    rownames_to_column("locale") %>% 
    tibble::as_tibble()

# Plot the overall accuracy by locale
ggplot(accList, aes(x=fct_reorder(locale, accOverall), y=accOverall)) + 
    geom_col(fill="lightblue") + 
    geom_text(aes(y=accOverall+0.02, label=round(accOverall, 3)), hjust=0) + 
    coord_flip() + 
    labs(x="", y="Overall Accuracy", title="Accuracy of Classifying Locale vs. All Other")

Interestingly, San Antonio stands out as the most distinct of the cities in the analysis. Several of the locales used previously as distinct archetypes - New Orleans, Las Vegas, San Diego - also score reasonably high in this exercise. As expected, the cold-weather cities are the most difficult to distinguish, with classification success in the 80% range (null success 50% given deliberate under-sampling of all-other).

Accuracy can also be assessed as “classifying self correctly” and “classifying all-other correctly”:

# Function to calculate and extract overall accuracy from the XGB list
helperXGBListAccuracySubset <- function(lst) {
    
    lst[["testData"]] %>%
        select(binKEY, predicted) %>%
        mutate(correct=binKEY==round(predicted)) %>%
        group_by(binKEY) %>%
        summarize(acc=mean(correct))
    
}

# Extract accuracy by subset by locale
accListSubset <- map_dfr(xgbOnevAll, .f=helperXGBListAccuracySubset, .id="locale") %>% 
    mutate(type=factor(ifelse(binKEY==1, "Classifying Self", "Classifying All Other"), 
                       levels=c("Classifying Self", "Classifying All Other")
                       )
           )

# Plot the accuracies by locale, facetted by seld vs. all other
ggplot(accListSubset, aes(x=fct_reorder(locale, acc), y=acc)) + 
    geom_col(fill="lightblue") + 
    geom_text(aes(y=acc/2, label=round(acc, 3)), hjust=0) + 
    coord_flip() + 
    labs(x="", y="Accuracy", title="Accuracy of Classifying Locale vs. All Other") + 
    facet_wrap(~type)

Models are generally more successful in classifying self that in classifying all-other. Put differently, the model is more likely to incorrectly flag an All Other locale as Self than it is to incorrectly flag Self as All Other.

Errors in classifying All Other as Self can then be explored further:

# Function to calculate and extract accuracy by locName fct from the XGB list
helperXGBListErrorByTrueLocation <- function(lst) {
    
    lst[["testData"]] %>%
        select(locNamefct, binKEY, predicted) %>%
        mutate(correct=binKEY==round(predicted)) %>%
        filter(binKEY==0) %>%
        group_by(locNamefct) %>%
        summarize(acc=mean(correct))
    
}

# Extract by locale
accListTrueLocation <- map_dfr(xgbOnevAll, .f=helperXGBListErrorByTrueLocation, .id="modelLocale")

# Plot accuracy
accFactors <- accListTrueLocation %>% 
    group_by(modelLocale) %>% 
    summarize(meanAcc=mean(acc)) %>%
    arrange(meanAcc) %>%
    pull(modelLocale)

accListTrueLocation %>%
    ggplot(aes(x=factor(modelLocale, levels=accFactors), 
               y=factor(locNamefct, levels=accFactors)
               )
           ) + 
    geom_tile(aes(fill=1-acc)) + 
    geom_text(aes(label=paste0(round(100*(1-acc)), "%")),size=3) + 
    scale_x_discrete(position="top") + 
    labs(x="Modeling Locale", 
         y="True Locale", 
         title="Rate of Misclassifying True Locale as Modeling Locale"
         ) + 
    theme(axis.text.x=element_text(angle=90)) + 
    scale_fill_continuous("Error Rate", low="white", high="red")

When Seattle and Denver are just a small part of the deliberately under-sampled All Other class, they are routinely misclassified as being part of many other locales. When they are the full-sampled class, almost no other city is misclassified with any frequency as being them. This suggests that Seattle and Denver may each be distinct, but in a rather nuanced manner that requires significant training data volumes to learn.

Next steps are to explore some of the more challenging one-vs-one classifications suggested by this grid, then to extend the analysis to explore the multi-class classification capabilities of XGB.

Locale pairs that are difficult to classify are identified based on the preceding analysis. There are two types of “difficult to classify”:

  1. Generally difficult - A is often classified as B when the model is trained as B vs. all, while B is often classified as A when the model is trained as A vs. all
  2. Directionally difficult - A is often classified as B when the model is trained as B vs. all, while B is often correctly classified as “not A” when the model is trained as A vs. all

The file accListTrueLocation is merged to itself to identify these cases:

# Create the frame of mean error and mean difference in error for A vs all and B vs all
accSummary <- accListTrueLocation %>%
    mutate(x=factor(modelLocale), modelLocale=locNamefct, locNamefct=x) %>%
    select(modelLocale, locNamefct, acc1=acc) %>%
    inner_join(mutate(accListTrueLocation, modelLocale=factor(modelLocale))) %>%
    mutate(meanError=1-(acc1 + acc)/2, deltaError=abs(acc-acc1))
## Joining, by = c("modelLocale", "locNamefct")
# The file contains two (pragmatically identical) records for each city pair - once as A/B, once as B/A
# Keep only the records where modelLocale is "smaller" than locNamefct
accSummaryUnique <- accSummary %>%
    filter(pmin(as.character(modelLocale), as.character(locNamefct))==as.character(modelLocale))

# Highest mean error
highMeanError <- accSummaryUnique %>%
    arrange(-meanError) %>%
    filter(meanError >= 0.5)
highMeanError
## # A tibble: 29 x 6
##    modelLocale      locNamefct        acc1   acc meanError deltaError
##    <fct>            <fct>            <dbl> <dbl>     <dbl>      <dbl>
##  1 Newark, NJ       Philadelphia, PA 0.108 0.165     0.863     0.0564
##  2 Miami, FL        Tampa Bay, FL    0.231 0.338     0.715     0.106 
##  3 Philadelphia, PA Washington, DC   0.362 0.267     0.685     0.0958
##  4 Green Bay, WI    Madison, WI      0.205 0.466     0.664     0.260 
##  5 Detroit, MI      Grand Rapids, MI 0.348 0.333     0.659     0.0150
##  6 Las Vegas, NV    Phoenix, AZ      0.322 0.366     0.656     0.0440
##  7 Los Angeles, CA  San Diego, CA    0.408 0.294     0.649     0.114 
##  8 Chicago, IL      Milwaukee, WI    0.421 0.281     0.649     0.140 
##  9 Madison, WI      Milwaukee, WI    0.404 0.375     0.610     0.0293
## 10 Chicago, IL      Detroit, MI      0.436 0.349     0.607     0.0862
## # ... with 19 more rows
highMeanError %>% mutate(n=n()) %>% select_if(is.numeric) %>% summarize_all(mean)
## # A tibble: 1 x 5
##    acc1   acc meanError deltaError     n
##   <dbl> <dbl>     <dbl>      <dbl> <dbl>
## 1 0.430 0.388     0.591      0.111    29
# Highest delta error
highDeltaError <- accSummaryUnique %>%
    arrange(-deltaError) %>%
    filter(deltaError >= 0.2)
highDeltaError
## # A tibble: 22 x 6
##    modelLocale      locNamefct       acc1   acc meanError deltaError
##    <fct>            <fct>           <dbl> <dbl>     <dbl>      <dbl>
##  1 New Orleans, LA  Tampa Bay, FL   0.309 0.744     0.474      0.436
##  2 Houston, TX      New Orleans, LA 0.728 0.293     0.490      0.436
##  3 Dallas, TX       San Antonio, TX 0.842 0.474     0.342      0.368
##  4 Houston, TX      Miami, FL       0.763 0.412     0.412      0.351
##  5 San Jose, CA     Seattle, WA     1     0.670     0.165      0.330
##  6 Atlanta, GA      San Antonio, TX 0.938 0.622     0.220      0.316
##  7 Milwaukee, WI    Seattle, WA     0.933 0.617     0.225      0.316
##  8 Grand Rapids, MI Seattle, WA     0.955 0.667     0.189      0.288
##  9 Madison, WI      Seattle, WA     0.935 0.648     0.208      0.287
## 10 Saint Louis, MO  San Antonio, TX 0.862 0.598     0.270      0.264
## # ... with 12 more rows
highDeltaError %>% mutate(n=n()) %>% select_if(is.numeric) %>% summarize_all(mean)
## # A tibble: 1 x 5
##    acc1   acc meanError deltaError     n
##   <dbl> <dbl>     <dbl>      <dbl> <dbl>
## 1 0.716 0.582     0.351      0.274    22

There are 29 locale pairs where the average error rate is worse than a coin flip (mean error for these pairs in 60%). There are 18 locale pairs where the directional error rate differs by at least 20%.

Each of these groupings is run through the prediction process individually, with the goal of seeing how the error rates evolve when the pairs are compared individually. Function xgbOnevAll() is modified slightly so that there is no under-sampling if a parameter is passed as FALSE. This is due to the one vs. one comparisons being of the same size, and an attempt to under-sample potentially causing errors due to very small differences in METAR data capture (bad data) by locale:

# Create container for each pairing in highMeanError
# Name for modelLocale
highMeanErrorList <- vector("list", nrow(highMeanError))
names(highMeanErrorList) <- highMeanError$modelLocale

# Run model for each pairing in highMeanError
for (n in 1:nrow(highMeanError)) {
    
    # Extract the key locale and the other locale
    # Note that which locale is defined as key is arbitrary and unimportant since this is a full 1:1 comparison
    keyLoc <- as.character(highMeanError$modelLocale)[n]
    otherLoc <- as.character(highMeanError$locNamefct)[n]
    
    # Run XGB for 500 rounds using only two locales and 2016 data; do not under-sample 'all other'
    highMeanErrorList[[n]] <- helperXGBOnevAll(metarData, 
                                               keyLoc=keyLoc, 
                                               critFilter=list(year=2016, 
                                                               locNamefct=c(keyLoc, otherLoc)
                                                               ), 
                                               underSample=FALSE,
                                               seed=2008071346
                                               )
    
}
## 
## 
##  **************
## Running for Newark, NJ with decription Newark 
## [1]  train-error:0.411434 
## [101]    train-error:0.142880 
## [201]    train-error:0.064427 
## [301]    train-error:0.032376 
## [401]    train-error:0.012641 
## [500]    train-error:0.004241 
## Finished processing Newark, NJ with overall test set accuracy 0.696 
## 
## 
##  **************
## Running for Miami, FL with decription Miami 
## [1]  train-error:0.257245 
## [101]    train-error:0.072822 
## [201]    train-error:0.033880 
## [301]    train-error:0.014287 
## [401]    train-error:0.007021 
## [500]    train-error:0.002449 
## Finished processing Miami, FL with overall test set accuracy 0.854 
## 
## 
##  **************
## Running for Philadelphia, PA with decription Philadelphia 
## [1]  train-error:0.365027 
## [101]    train-error:0.105720 
## [201]    train-error:0.046068 
## [301]    train-error:0.020538 
## [401]    train-error:0.006628 
## [500]    train-error:0.002046 
## Finished processing Philadelphia, PA with overall test set accuracy 0.767 
## 
## 
##  **************
## Running for Green Bay, WI with decription Green Bay 
## [1]  train-error:0.375268 
## [101]    train-error:0.087229 
## [201]    train-error:0.033454 
## [301]    train-error:0.013630 
## [401]    train-error:0.004791 
## [500]    train-error:0.001817 
## Finished processing Green Bay, WI with overall test set accuracy 0.793 
## 
## 
##  **************
## Running for Detroit, MI with decription Detroit 
## [1]  train-error:0.410259 
## [101]    train-error:0.115100 
## [201]    train-error:0.050469 
## [301]    train-error:0.025029 
## [401]    train-error:0.009633 
## [500]    train-error:0.002717 
## Finished processing Detroit, MI with overall test set accuracy 0.751 
## 
## 
##  **************
## Running for Las Vegas, NV with decription Las Vegas 
## [1]  train-error:0.229201 
## [101]    train-error:0.020554 
## [201]    train-error:0.003685 
## [301]    train-error:0.000491 
## [401]    train-error:0.000082 
## [500]    train-error:0.000000 
## Finished processing Las Vegas, NV with overall test set accuracy 0.948 
## 
## 
##  **************
## Running for Los Angeles, CA with decription Los Angeles 
## [1]  train-error:0.246527 
## [101]    train-error:0.055646 
## [201]    train-error:0.023206 
## [301]    train-error:0.008416 
## [401]    train-error:0.002942 
## [500]    train-error:0.000735 
## Finished processing Los Angeles, CA with overall test set accuracy 0.876 
## 
## 
##  **************
## Running for Chicago, IL with decription Chicago 
## [1]  train-error:0.424537 
## [101]    train-error:0.124561 
## [201]    train-error:0.055179 
## [301]    train-error:0.024325 
## [401]    train-error:0.010040 
## [500]    train-error:0.004245 
## Finished processing Chicago, IL with overall test set accuracy 0.709 
## 
## 
##  **************
## Running for Madison, WI with decription Madison 
## [1]  train-error:0.366549 
## [101]    train-error:0.110187 
## [201]    train-error:0.048835 
## [301]    train-error:0.016635 
## [401]    train-error:0.005435 
## [500]    train-error:0.001729 
## Finished processing Madison, WI with overall test set accuracy 0.757 
## 
## 
##  **************
## Running for Chicago, IL with decription Chicago 
## [1]  train-error:0.381528 
## [101]    train-error:0.093339 
## [201]    train-error:0.044626 
## [301]    train-error:0.017736 
## [401]    train-error:0.004904 
## [500]    train-error:0.000981 
## Finished processing Chicago, IL with overall test set accuracy 0.783 
## 
## 
##  **************
## Running for Newark, NJ with decription Newark 
## [1]  train-error:0.358078 
## [101]    train-error:0.082665 
## [201]    train-error:0.034457 
## [301]    train-error:0.011049 
## [401]    train-error:0.004092 
## [500]    train-error:0.000900 
## Finished processing Newark, NJ with overall test set accuracy 0.82 
## 
## 
##  **************
## Running for Madison, WI with decription Madison 
## [1]  train-error:0.348650 
## [101]    train-error:0.084074 
## [201]    train-error:0.030138 
## [301]    train-error:0.009058 
## [401]    train-error:0.003870 
## [500]    train-error:0.001070 
## Finished processing Madison, WI with overall test set accuracy 0.829 
## 
## 
##  **************
## Running for Boston, MA with decription Boston 
## [1]  train-error:0.355678 
## [101]    train-error:0.077781 
## [201]    train-error:0.029291 
## [301]    train-error:0.010010 
## [401]    train-error:0.002543 
## [500]    train-error:0.000820 
## Finished processing Boston, MA with overall test set accuracy 0.816 
## 
## 
##  **************
## Running for Chicago, IL with decription Chicago 
## [1]  train-error:0.394594 
## [101]    train-error:0.109842 
## [201]    train-error:0.046172 
## [301]    train-error:0.017992 
## [401]    train-error:0.007148 
## [500]    train-error:0.002136 
## Finished processing Chicago, IL with overall test set accuracy 0.763 
## 
## 
##  **************
## Running for Green Bay, WI with decription Green Bay 
## [1]  train-error:0.367822 
## [101]    train-error:0.090440 
## [201]    train-error:0.036782 
## [301]    train-error:0.013025 
## [401]    train-error:0.003932 
## [500]    train-error:0.001720 
## Finished processing Green Bay, WI with overall test set accuracy 0.794 
## 
## 
##  **************
## Running for Detroit, MI with decription Detroit 
## [1]  train-error:0.373303 
## [101]    train-error:0.096025 
## [201]    train-error:0.041714 
## [301]    train-error:0.017177 
## [401]    train-error:0.007280 
## [500]    train-error:0.002454 
## Finished processing Detroit, MI with overall test set accuracy 0.815 
## 
## 
##  **************
## Running for Grand Rapids, MI with decription Grand Rapids 
## [1]  train-error:0.343531 
## [101]    train-error:0.079410 
## [201]    train-error:0.028037 
## [301]    train-error:0.007916 
## [401]    train-error:0.001484 
## [500]    train-error:0.000082 
## Finished processing Grand Rapids, MI with overall test set accuracy 0.827 
## 
## 
##  **************
## Running for Madison, WI with decription Madison 
## [1]  train-error:0.362537 
## [101]    train-error:0.096413 
## [201]    train-error:0.042530 
## [301]    train-error:0.014314 
## [401]    train-error:0.005512 
## [500]    train-error:0.001645 
## Finished processing Madison, WI with overall test set accuracy 0.797 
## 
## 
##  **************
## Running for Chicago, IL with decription Chicago 
## [1]  train-error:0.340219 
## [101]    train-error:0.092832 
## [201]    train-error:0.039832 
## [301]    train-error:0.016542 
## [401]    train-error:0.004691 
## [500]    train-error:0.001234 
## Finished processing Chicago, IL with overall test set accuracy 0.778 
## 
## 
##  **************
## Running for Chicago, IL with decription Chicago 
## [1]  train-error:0.396990 
## [101]    train-error:0.095281 
## [201]    train-error:0.035986 
## [301]    train-error:0.015621 
## [401]    train-error:0.005398 
## [500]    train-error:0.001636 
## Finished processing Chicago, IL with overall test set accuracy 0.815 
## 
## 
##  **************
## Running for Grand Rapids, MI with decription Grand Rapids 
## [1]  train-error:0.302754 
## [101]    train-error:0.085491 
## [201]    train-error:0.033457 
## [301]    train-error:0.011262 
## [401]    train-error:0.004768 
## [500]    train-error:0.001069 
## Finished processing Grand Rapids, MI with overall test set accuracy 0.828 
## 
## 
##  **************
## Running for Grand Rapids, MI with decription Grand Rapids 
## [1]  train-error:0.367164 
## [101]    train-error:0.066233 
## [201]    train-error:0.019689 
## [301]    train-error:0.006014 
## [401]    train-error:0.001236 
## [500]    train-error:0.000082 
## Finished processing Grand Rapids, MI with overall test set accuracy 0.854 
## 
## 
##  **************
## Running for Boston, MA with decription Boston 
## [1]  train-error:0.341920 
## [101]    train-error:0.072518 
## [201]    train-error:0.026333 
## [301]    train-error:0.010090 
## [401]    train-error:0.002543 
## [500]    train-error:0.000246 
## Finished processing Boston, MA with overall test set accuracy 0.845 
## 
## 
##  **************
## Running for Houston, TX with decription Houston 
## [1]  train-error:0.273485 
## [101]    train-error:0.053341 
## [201]    train-error:0.021157 
## [301]    train-error:0.006372 
## [401]    train-error:0.002287 
## [500]    train-error:0.000327 
## Finished processing Houston, TX with overall test set accuracy 0.882 
## 
## 
##  **************
## Running for Detroit, MI with decription Detroit 
## [1]  train-error:0.348135 
## [101]    train-error:0.070412 
## [201]    train-error:0.025352 
## [301]    train-error:0.006624 
## [401]    train-error:0.001717 
## [500]    train-error:0.000327 
## Finished processing Detroit, MI with overall test set accuracy 0.845 
## 
## 
##  **************
## Running for Grand Rapids, MI with decription Grand Rapids 
## [1]  train-error:0.368565 
## [101]    train-error:0.095416 
## [201]    train-error:0.040040 
## [301]    train-error:0.013844 
## [401]    train-error:0.005305 
## [500]    train-error:0.001492 
## Finished processing Grand Rapids, MI with overall test set accuracy 0.798 
## 
## 
##  **************
## Running for Detroit, MI with decription Detroit 
## [1]  train-error:0.362400 
## [101]    train-error:0.073103 
## [201]    train-error:0.027618 
## [301]    train-error:0.008933 
## [401]    train-error:0.002868 
## [500]    train-error:0.000574 
## Finished processing Detroit, MI with overall test set accuracy 0.831 
## 
## 
##  **************
## Running for Milwaukee, WI with decription Milwaukee 
## [1]  train-error:0.407719 
## [101]    train-error:0.082409 
## [201]    train-error:0.028394 
## [301]    train-error:0.011668 
## [401]    train-error:0.003753 
## [500]    train-error:0.000653 
## Finished processing Milwaukee, WI with overall test set accuracy 0.837 
## 
## 
##  **************
## Running for Grand Rapids, MI with decription Grand Rapids 
## [1]  train-error:0.403437 
## [101]    train-error:0.116090 
## [201]    train-error:0.044808 
## [301]    train-error:0.017101 
## [401]    train-error:0.007564 
## [500]    train-error:0.002713 
## Finished processing Grand Rapids, MI with overall test set accuracy 0.772

The error rates are then extracted (overall and by subtype:

# Helper function to extract one vs one accuracy
helperExtractOnevOneAccuracy <- function(lst) {
    
    # Create accuracy for each sub-group and overall
    rawSummary <- lst[["testData"]] %>%
        count(locNamefct, correct=round(predicted)==binKEY) %>%
        mutate(locNamefct=as.character(locNamefct)) %>%
        bind_rows(mutate(., locNamefct="Overall")) %>%
        group_by(locNamefct) %>%
        summarize(nTotal=sum(n), nCorrect=sum(n*(correct==TRUE))) %>%
        ungroup() %>%
        mutate(acc=nCorrect/nTotal)
    
    # Create one-row tibble with locA, locB, accA, accB, accOverall
    locs=rawSummary$locNamefct %>% setdiff("Overall")
    tibble::tibble(locA=min(locs), 
                   locB=max(locs), 
                   accA=rawSummary %>% filter(locNamefct==locA) %>% pull(acc),
                   accB=rawSummary %>% filter(locNamefct==locB) %>% pull(acc), 
                   accOverall=rawSummary %>% filter(locNamefct=="Overall") %>% pull(acc)
                   )
    
}

# Extract accuracy by subset by locale
accHighMeanError <- map_dfr(highMeanErrorList, .f=helperExtractOnevOneAccuracy)

# Combine with the original file, highMeanError
highMeanOutput <- highMeanError %>%
    mutate(locA=as.character(modelLocale), 
           locB=as.character(locNamefct), 
           original=1-meanError
           ) %>%
    select(locA, locB, original) %>%
    inner_join(accHighMeanError, by=c("locA", "locB")) %>%
    select(-accA, -accB, new=accOverall) %>%
    pivot_longer(-c(locA, locB), names_to="model", values_to="accuracy") %>%
    mutate(desc=paste0(locA, " vs. ", locB)) 

# Plot the accuracy data
highMeanOutput %>%
    ggplot(aes(x=fct_reorder(desc, accuracy), y=accuracy, color=model)) + 
    geom_point() + 
    geom_text(aes(y=accuracy-0.02, label=round(accuracy, 2)), hjust=1, size=4) +
    coord_flip() + 
    labs(x="", y="Accuracy", title="Change in Accuracy - Original One vs. All, New One vs, One") + 
    geom_hline(aes(yintercept=0.5), lty=2) + 
    ylim(c(0, 1))

# Plot the change in accuracy data
highMeanOutput %>%
    group_by(desc) %>%
    summarize(accuracyGain=max(accuracy)-min(accuracy)) %>%
    ggplot(aes(x=fct_reorder(desc, accuracyGain), y=accuracyGain)) + 
    geom_col(fill="lightblue") + 
    geom_text(aes(y=accuracyGain+0.02, label=round(accuracyGain, 2)), hjust=0) + 
    coord_flip() + 
    labs(x="", y="Gain in Accuracy", title="Gain in Accuracy (One vs One compared to One vs. All")

# Check the highest delta error records
accHighMeanError %>%
    mutate(deltaAccuracy=abs(accA-accB)) %>%
    arrange(-deltaAccuracy)
## # A tibble: 29 x 6
##    locA             locB               accA  accB accOverall deltaAccuracy
##    <chr>            <chr>             <dbl> <dbl>      <dbl>         <dbl>
##  1 Madison, WI      Traverse City, MI 0.809 0.849      0.829        0.0407
##  2 Chicago, IL      Madison, WI       0.761 0.795      0.778        0.0341
##  3 Milwaukee, WI    Minneapolis, MN   0.851 0.823      0.837        0.0282
##  4 Philadelphia, PA Washington, DC    0.755 0.779      0.767        0.0232
##  5 Detroit, MI      Traverse City, MI 0.835 0.856      0.845        0.0215
##  6 Newark, NJ       Philadelphia, PA  0.686 0.706      0.696        0.0198
##  7 Grand Rapids, MI Madison, WI       0.789 0.807      0.798        0.0181
##  8 Green Bay, WI    Milwaukee, WI     0.786 0.803      0.794        0.0177
##  9 Chicago, IL      Grand Rapids, MI  0.771 0.754      0.763        0.0164
## 10 Chicago, IL      Indianapolis, IN  0.808 0.823      0.815        0.0155
## # ... with 19 more rows

Accuracy gains are significant, with many comparisons gaining ~40% accuracy when trained one vs. one rather than being just a small component of a one vs. all training. Gains are especially notable for Las Vegas/Phoenix, which soars from 36% accuracy (worse than null) to 95% accuracy. Large gains are also noted for Miami/Tampa and Los Angeles/San Diego, suggesting that a focused one-on-one training can help pull apart similar archetype cities.

That said, there are still a large number of pairings where the differentiation appears to be OK (accuracy around 80% vs. null accuracy 50%) but still with meaningful confusion.

There is little delta error whether looking at A/B or B/A, not surprising given the balanced nature of the data used for modeling and the generally small delta error for these pairings even in the previous one vs all.

The process is run again for the high delta-error pairings:

# Create container for each pairing in highDeltaError
# Name for modelLocale
highDeltaErrorList <- vector("list", nrow(highDeltaError))
names(highDeltaErrorList) <- highDeltaError$modelLocale

# Run model for each pairing in highDeltaError
for (n in 1:nrow(highDeltaError)) {
    
    # Extract the key locale and the other locale
    # Note that which locale is defined as key is arbitrary and unimportant since this is a full 1:1 comparison
    keyLoc <- as.character(highDeltaError$modelLocale)[n]
    otherLoc <- as.character(highDeltaError$locNamefct)[n]
    
    # Run XGB for 500 rounds using only two locales and 2016 data; do not under-sample 'all other'
    highDeltaErrorList[[n]] <- helperXGBOnevAll(metarData, 
                                                keyLoc=keyLoc, 
                                                critFilter=list(year=2016, 
                                                                locNamefct=c(keyLoc, otherLoc)
                                                                ), 
                                                underSample=FALSE,
                                                seed=2008071439
                                                )
    
}
## 
## 
##  **************
## Running for New Orleans, LA with decription New Orleans 
## [1]  train-error:0.305934 
## [101]    train-error:0.007754 
## [201]    train-error:0.001551 
## [301]    train-error:0.000082 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing New Orleans, LA with overall test set accuracy 0.976 
## 
## 
##  **************
## Running for Houston, TX with decription Houston 
## [1]  train-error:0.350674 
## [101]    train-error:0.016660 
## [201]    train-error:0.003838 
## [301]    train-error:0.000898 
## [401]    train-error:0.000163 
## [500]    train-error:0.000000 
## Finished processing Houston, TX with overall test set accuracy 0.95 
## 
## 
##  **************
## Running for Dallas, TX with decription Dallas 
## [1]  train-error:0.302778 
## [101]    train-error:0.000408 
## [201]    train-error:0.000000 
## [301]    train-error:0.000000 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing Dallas, TX with overall test set accuracy 0.995 
## 
## 
##  **************
## Running for Houston, TX with decription Houston 
## [1]  train-error:0.220534 
## [101]    train-error:0.033407 
## [201]    train-error:0.011027 
## [301]    train-error:0.003185 
## [401]    train-error:0.000490 
## [500]    train-error:0.000082 
## Finished processing Houston, TX with overall test set accuracy 0.923 
## 
## 
##  **************
## Running for San Jose, CA with decription San Jose 
## [1]  train-error:0.172721 
## [101]    train-error:0.000082 
## [201]    train-error:0.000000 
## [301]    train-error:0.000000 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing San Jose, CA with overall test set accuracy 0.996 
## 
## 
##  **************
## Running for Atlanta, GA with decription Atlanta 
## [1]  train-error:0.226776 
## [101]    train-error:0.000327 
## [201]    train-error:0.000000 
## [301]    train-error:0.000000 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing Atlanta, GA with overall test set accuracy 0.996 
## 
## 
##  **************
## Running for Milwaukee, WI with decription Milwaukee 
## [1]  train-error:0.185346 
## [101]    train-error:0.002859 
## [201]    train-error:0.000000 
## [301]    train-error:0.000000 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing Milwaukee, WI with overall test set accuracy 0.985 
## 
## 
##  **************
## Running for Grand Rapids, MI with decription Grand Rapids 
## [1]  train-error:0.196185 
## [101]    train-error:0.002302 
## [201]    train-error:0.000000 
## [301]    train-error:0.000000 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing Grand Rapids, MI with overall test set accuracy 0.984 
## 
## 
##  **************
## Running for Madison, WI with decription Madison 
## [1]  train-error:0.177730 
## [101]    train-error:0.003871 
## [201]    train-error:0.000082 
## [301]    train-error:0.000000 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing Madison, WI with overall test set accuracy 0.982 
## 
## 
##  **************
## Running for Saint Louis, MO with decription Saint Louis 
## [1]  train-error:0.279400 
## [101]    train-error:0.001312 
## [201]    train-error:0.000000 
## [301]    train-error:0.000000 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing Saint Louis, MO with overall test set accuracy 0.99 
## 
## 
##  **************
## Running for Green Bay, WI with decription Green Bay 
## [1]  train-error:0.372873 
## [101]    train-error:0.092268 
## [201]    train-error:0.035602 
## [301]    train-error:0.013051 
## [401]    train-error:0.003635 
## [500]    train-error:0.000991 
## Finished processing Green Bay, WI with overall test set accuracy 0.791 
## 
## 
##  **************
## Running for Grand Rapids, MI with decription Grand Rapids 
## [1]  train-error:0.341717 
## [101]    train-error:0.077925 
## [201]    train-error:0.023501 
## [301]    train-error:0.007751 
## [401]    train-error:0.002391 
## [500]    train-error:0.000412 
## Finished processing Grand Rapids, MI with overall test set accuracy 0.826 
## 
## 
##  **************
## Running for Atlanta, GA with decription Atlanta 
## [1]  train-error:0.285691 
## [101]    train-error:0.020699 
## [201]    train-error:0.002863 
## [301]    train-error:0.000164 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing Atlanta, GA with overall test set accuracy 0.937 
## 
## 
##  **************
## Running for Chicago, IL with decription Chicago 
## [1]  train-error:0.296744 
## [101]    train-error:0.060230 
## [201]    train-error:0.016404 
## [301]    train-error:0.003754 
## [401]    train-error:0.000979 
## [500]    train-error:0.000082 
## Finished processing Chicago, IL with overall test set accuracy 0.862 
## 
## 
##  **************
## Running for Green Bay, WI with decription Green Bay 
## [1]  train-error:0.154924 
## [101]    train-error:0.006063 
## [201]    train-error:0.000164 
## [301]    train-error:0.000000 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing Green Bay, WI with overall test set accuracy 0.975 
## 
## 
##  **************
## Running for Milwaukee, WI with decription Milwaukee 
## [1]  train-error:0.293287 
## [101]    train-error:0.070974 
## [201]    train-error:0.025237 
## [301]    train-error:0.008821 
## [401]    train-error:0.001797 
## [500]    train-error:0.000245 
## Finished processing Milwaukee, WI with overall test set accuracy 0.835 
## 
## 
##  **************
## Running for Atlanta, GA with decription Atlanta 
## [1]  train-error:0.324559 
## [101]    train-error:0.046494 
## [201]    train-error:0.014432 
## [301]    train-error:0.004100 
## [401]    train-error:0.001148 
## [500]    train-error:0.000410 
## Finished processing Atlanta, GA with overall test set accuracy 0.892 
## 
## 
##  **************
## Running for Indianapolis, IN with decription Indianapolis 
## [1]  train-error:0.343717 
## [101]    train-error:0.045049 
## [201]    train-error:0.011201 
## [301]    train-error:0.001308 
## [401]    train-error:0.000245 
## [500]    train-error:0.000000 
## Finished processing Indianapolis, IN with overall test set accuracy 0.896 
## 
## 
##  **************
## Running for Detroit, MI with decription Detroit 
## [1]  train-error:0.325319 
## [101]    train-error:0.077363 
## [201]    train-error:0.027723 
## [301]    train-error:0.008996 
## [401]    train-error:0.002290 
## [500]    train-error:0.000491 
## Finished processing Detroit, MI with overall test set accuracy 0.843 
## 
## 
##  **************
## Running for Miami, FL with decription Miami 
## [1]  train-error:0.265100 
## [101]    train-error:0.009631 
## [201]    train-error:0.002367 
## [301]    train-error:0.000163 
## [401]    train-error:0.000000 
## [500]    train-error:0.000000 
## Finished processing Miami, FL with overall test set accuracy 0.967 
## 
## 
##  **************
## Running for San Diego, CA with decription San Diego 
## [1]  train-error:0.238916 
## [101]    train-error:0.021801 
## [201]    train-error:0.004573 
## [301]    train-error:0.000653 
## [401]    train-error:0.000082 
## [500]    train-error:0.000000 
## Finished processing San Diego, CA with overall test set accuracy 0.94 
## 
## 
##  **************
## Running for Boston, MA with decription Boston 
## [1]  train-error:0.328260 
## [101]    train-error:0.032003 
## [201]    train-error:0.005594 
## [301]    train-error:0.001070 
## [401]    train-error:0.000082 
## [500]    train-error:0.000000 
## Finished processing Boston, MA with overall test set accuracy 0.949

The comparisons can be run again also:

# Extract accuracy by subset by locale
accHighDeltaError <- map_dfr(highDeltaErrorList, .f=helperExtractOnevOneAccuracy)

# Combine with the original file, highMeanError
highDeltaOutput <- highDeltaError %>%
    mutate(locA=as.character(modelLocale), 
           locB=as.character(locNamefct), 
           original=1-meanError
           ) %>%
    select(locA, locB, original) %>%
    inner_join(accHighDeltaError, by=c("locA", "locB")) %>%
    select(-accA, -accB, new=accOverall) %>%
    pivot_longer(-c(locA, locB), names_to="model", values_to="accuracy") %>%
    mutate(desc=paste0(locA, " vs. ", locB)) 

# Plot the accuracy data
highDeltaOutput %>%
    ggplot(aes(x=fct_reorder(desc, accuracy), y=accuracy, color=model)) + 
    geom_point() + 
    geom_text(aes(y=accuracy-0.02, label=round(accuracy, 2)), hjust=1, size=4) +
    coord_flip() + 
    labs(x="", y="Accuracy", title="Change in Accuracy - Original One vs. All, New One vs, One") + 
    geom_hline(aes(yintercept=0.5), lty=2) + 
    ylim(c(0, 1))

# Plot the change in accuracy data
highDeltaOutput %>%
    group_by(desc) %>%
    summarize(accuracyGain=max(accuracy)-min(accuracy)) %>%
    ggplot(aes(x=fct_reorder(desc, accuracyGain), y=accuracyGain)) + 
    geom_col(fill="lightblue") + 
    geom_text(aes(y=accuracyGain+0.02, label=round(accuracyGain, 2)), hjust=0) + 
    coord_flip() + 
    labs(x="", y="Gain in Accuracy", title="Gain in Accuracy (One vs One compared to One vs. All")

# Check the highest delta error records
accHighDeltaError %>%
    mutate(deltaAccuracy=abs(accA-accB)) %>%
    arrange(-deltaAccuracy)
## # A tibble: 22 x 6
##    locA             locB               accA  accB accOverall deltaAccuracy
##    <chr>            <chr>             <dbl> <dbl>      <dbl>         <dbl>
##  1 Indianapolis, IN Minneapolis, MN   0.882 0.911      0.896        0.0293
##  2 Atlanta, GA      Saint Louis, MO   0.904 0.879      0.892        0.0251
##  3 Chicago, IL      Traverse City, MI 0.851 0.873      0.862        0.0221
##  4 Detroit, MI      Traverse City, MI 0.833 0.853      0.843        0.0199
##  5 Milwaukee, WI    Traverse City, MI 0.826 0.845      0.835        0.0189
##  6 Houston, TX      New Orleans, LA   0.944 0.957      0.950        0.0132
##  7 Green Bay, WI    Madison, WI       0.796 0.785      0.791        0.0112
##  8 Grand Rapids, MI Seattle, WA       0.978 0.989      0.984        0.0112
##  9 Green Bay, WI    Seattle, WA       0.969 0.980      0.975        0.0108
## 10 Boston, MA       Indianapolis, IN  0.954 0.943      0.949        0.0104
## # ... with 12 more rows

There is no longer any meaningful difference in the A/B and B/A accuracy for these pairings, and the overall accuracy has in almost all cases climbed in to the 90%+ range. This is suggestive that these pairings are largely distinct but need a large training samplt to tease out the distinctions. The one exception is Detroit vs. Traverse City, which is notably the only pairing that was also part of the “low overall accuracy” runs.

Next steps are to explore what predictions look like for unrelated cities using the one vs. one models. This may induce significant extrapolation errors, and is intended solely as an exploratory step. The next main model will be the multiclass version of the XGB, taking care to have sufficient training data volume in an attempt to meaningfully capture the accuracy gains seen in the preceding analysis.

Converting the XGB approach to work for multiclass classification will require the following:

  1. Convert the target variable to an integer that indexes from 0 to n-1 (where there are n classes); note that technically this works for n-2, since the XGB algorithm uses 0 and 1 for that
  2. Convert the objective to “multi:softprob”
  3. Convert the evaluation metric (though this should happen automatically)
  4. Convert the predictions back to a normal value

The function xgbRunModel() is updated to accept xgbObjective=“multi:softprob”, and then data are created for the four locales with 2014-2019 data - Chicago, Las Vegas, New Orleans, San Diego:

fourLocales <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")

xgbFourLocalesCV <- xgbRunModel(filter(metarData, !is.na(TempF)), 
                                depVar="locNamefct", 
                                predVars=locXGBPreds, 
                                otherVars=keepVarFull, 
                                critFilter=list(year=2016, locNamefct=fourLocales), 
                                seed=2008081315,
                                nrounds=1000,
                                print_every_n=50,
                                xgbObjective="multi:softprob", 
                                funcRun=xgboost::xgb.cv,
                                nfold=5,
                                calcErr=FALSE,
                                num_class=length(fourLocales)
                                )
## [1]  train-merror:0.178692+0.001724  test-merror:0.191964+0.006761 
## [51] train-merror:0.019284+0.000676  test-merror:0.047940+0.003303 
## [101]    train-merror:0.003491+0.000273  test-merror:0.033607+0.002575 
## [151]    train-merror:0.000326+0.000083  test-merror:0.027727+0.002249 
## [201]    train-merror:0.000031+0.000025  test-merror:0.024909+0.001959 
## [251]    train-merror:0.000000+0.000000  test-merror:0.023643+0.001509 
## [301]    train-merror:0.000000+0.000000  test-merror:0.023031+0.001820 
## [351]    train-merror:0.000000+0.000000  test-merror:0.022541+0.001797 
## [401]    train-merror:0.000000+0.000000  test-merror:0.021847+0.001639 
## [451]    train-merror:0.000000+0.000000  test-merror:0.021642+0.001794 
## [501]    train-merror:0.000000+0.000000  test-merror:0.021234+0.001608 
## [551]    train-merror:0.000000+0.000000  test-merror:0.021071+0.001730 
## [601]    train-merror:0.000000+0.000000  test-merror:0.020948+0.001783 
## [651]    train-merror:0.000000+0.000000  test-merror:0.020621+0.001766 
## [701]    train-merror:0.000000+0.000000  test-merror:0.020744+0.001442 
## [751]    train-merror:0.000000+0.000000  test-merror:0.020581+0.001503 
## [801]    train-merror:0.000000+0.000000  test-merror:0.020458+0.001589 
## [851]    train-merror:0.000000+0.000000  test-merror:0.020377+0.001651 
## [901]    train-merror:0.000000+0.000000  test-merror:0.020621+0.001566 
## [951]    train-merror:0.000000+0.000000  test-merror:0.020662+0.001351 
## [1000]   train-merror:0.000000+0.000000  test-merror:0.020581+0.001313

Error evolution can then be plotted:

minTestError <- xgbFourLocalesCV$xgbModel$evaluation_log %>%
    pull(test_merror_mean) %>%
    min()

xgbFourLocalesCV$xgbModel$evaluation_log %>%
    filter(test_merror_mean <= 0.08) %>%
    ggplot(aes(x=iter, y=test_merror_mean)) +
    geom_line() +
    labs(x="# Iterations", 
         y="Test Error", 
         title="Evolution of Test Error", 
         subtitle="Filtered to Test Error <= 0.08"
         ) + 
    ylim(c(0, NA)) + 
    geom_vline(aes(xintercept=0), lty=2) + 
    geom_hline(aes(yintercept=minTestError), color="red", lty=2)

Test error appears to be near the minimum at around 750 iterations. The xgboost::xgboost algorithm is run with 750 iterations, and with predictions made for the best locale:

fourLocales <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")

xgbFourLocales <- xgbRunModel(filter(metarData, !is.na(TempF)), 
                              depVar="locNamefct", 
                              predVars=locXGBPreds, 
                              otherVars=keepVarFull, 
                              critFilter=list(year=2016, locNamefct=fourLocales), 
                              seed=2008081315,
                              nrounds=750,
                              print_every_n=50,
                              xgbObjective="multi:softprob", 
                              funcRun=xgboost::xgboost,
                              calcErr=FALSE,
                              num_class=length(fourLocales)
                              )
## [1]  train-merror:0.180081 
## [51] train-merror:0.021112 
## [101]    train-merror:0.004614 
## [151]    train-merror:0.000449 
## [201]    train-merror:0.000041 
## [251]    train-merror:0.000000 
## [301]    train-merror:0.000000 
## [351]    train-merror:0.000000 
## [401]    train-merror:0.000000 
## [451]    train-merror:0.000000 
## [501]    train-merror:0.000000 
## [551]    train-merror:0.000000 
## [601]    train-merror:0.000000 
## [651]    train-merror:0.000000 
## [701]    train-merror:0.000000 
## [750]    train-merror:0.000000
## Warning: `as_tibble.matrix()` requires a matrix with column names or a `.name_repair` argument. Using compatibility `.name_repair`.
## This warning is displayed once per session.

Classification performance on test data can then be assessed:

# Overall success
xgbFourLocales$testData %>%
    summarize(mean(locNamefct==predicted))
## # A tibble: 1 x 1
##   `mean(locNamefct == predicted)`
##                             <dbl>
## 1                           0.985
# Histogram of predicted probabilities for selected class
xgbFourLocales$testData %>%
    mutate(correct=locNamefct==predicted) %>%
    ggplot() + 
    stat_count(aes(x=round(probPredicted, 2), y=..prop.., group=correct)) + 
    facet_wrap(~correct) + 
    labs(x="Probability Given to Prediction", 
         y="Proportion of Predictions", 
         title="Probability of Prediction vs. Accuracy of Prediction"
         )

# Confusion matrix
xgbFourLocales$testData %>%
    count(locNamefct, predicted) %>%
    group_by(locNamefct) %>%
    mutate(pct=n/sum(n)) %>%
    ggplot(aes(x=predicted, y=locNamefct)) + 
    geom_tile(aes(fill=pct)) + 
    geom_text(aes(label=paste0(round(100*pct), "%"))) + 
    scale_fill_continuous("", low="white", high="green") + 
    labs(title="Predicted vs. Actual Locale Frequency", y="Actual Locale", x="Predicted Locale")

# Find and plot importances
xgb_fourLocales_importance <- plotXGBImportance(xgbFourLocales, 
                                                featureStems=locXGBPreds, 
                                                stemMapper = varMapper, 
                                                plotTitle="Gain by variable in xgboost", 
                                                plotSubtitle="Modeling 2016 Locale (LAS, MSY, ORD, SAN)"
                                                )
## # A tibble: 10 x 4
##    Feature          Gain  Cover Frequency
##    <chr>           <dbl>  <dbl>     <dbl>
##  1 month         0.143   0.187     0.105 
##  2 hrfct         0.00824 0.0563    0.0561
##  3 TempF         0.195   0.132     0.129 
##  4 DewF          0.303   0.147     0.139 
##  5 modSLP        0.114   0.164     0.206 
##  6 Altimeter     0.0635  0.104     0.161 
##  7 WindSpeed     0.0523  0.0549    0.0856
##  8 predomDir     0.0441  0.0733    0.0513
##  9 minHeight     0.0668  0.0619    0.0418
## 10 ceilingHeight 0.00958 0.0192    0.0247

Overall prediction accuracy is ~98%, with incorrect predictions having much lower probabilities than correct predictions. Predictions for every locale appear to be of about equal (and very high) accuracy. Dew point stands out as the best differentiato, with temperature, month, and sea-level pressure next highest in gain. This is consistent with previous analysis on these locales.

Overall, the base XGB parameters seem to be driving increased multiclass accuracy in a much shorter run time than random forest.

How well does the 2016 model predict data from 2014-2015 and 2017-2019 for these same locales?

# Extract data for the four locales, excluding 2016
non2016FourLocalesData <- metarData %>%
    filter(!is.na(TempF), year != 2016, locNamefct %in% fourLocales)

# Make the predictions
non2016FourLocalesPred <- non2016FourLocalesData %>%
    mutate_if(is.factor, .funs=fct_drop) %>%
    helperMakeSparse(predVars=locXGBPreds) %>%
    predict(xgbFourLocales$xgbModel, newdata=.)

# Create the prediction matrix
non2016FourLocalesMatrix <- matrix(data=non2016FourLocalesPred, 
                                   ncol=length(fourLocales), 
                                   nrow=nrow(non2016FourLocalesData), 
                                   byrow=TRUE
                                   )

# Get the predictions and probabilities, and add them to non2016FourLocalesData
maxCol <- apply(non2016FourLocalesMatrix, 1, FUN=which.max)
non2016FourLocalesData <- non2016FourLocalesData %>%
    mutate(predicted=xgbFourLocales$yTrainLevels[maxCol], 
           probPredicted=apply(non2016FourLocalesMatrix, 1, FUN=max), 
           correct=locNamefct==predicted
           )

# Assess overall prediction accuracy by year
helperNon2016Accuracy <- function(df, grpVar, grpLabel) {
    p1 <- df %>%
        group_by_at(grpVar) %>%
        summarize(pctCorrect=mean(correct)) %>%
        ggplot(aes(x=factor(get(grpVar)), y=pctCorrect)) + 
        geom_col(fill="lightblue") + 
        geom_text(aes(y=pctCorrect/2, label=paste0(round(100*pctCorrect), "%"))) +
        coord_flip() + 
        ylim(c(0, 1)) + 
        labs(x=grpLabel, y="Percent Correct", title="Accuracy of predictions to non-2016 data")
    print(p1)
}

helperNon2016Accuracy(non2016FourLocalesData, grpVar="year", grpLabel="Year")

helperNon2016Accuracy(non2016FourLocalesData, grpVar="locNamefct", grpLabel="Actual Locale")

helperNon2016Accuracy(non2016FourLocalesData, grpVar="month", grpLabel="Month")

helperNon2016Accuracy(non2016FourLocalesData, grpVar="hrfct", grpLabel="Hour (Zulu Time)")

# Confusion matrix
non2016FourLocalesData %>%
    count(locNamefct, predicted) %>%
    group_by(locNamefct) %>%
    mutate(pct=n/sum(n)) %>%
    ggplot(aes(x=predicted, y=locNamefct)) + 
    geom_tile(aes(fill=pct)) + 
    geom_text(aes(label=paste0(round(100*pct), "%"))) + 
    scale_fill_continuous("", low="white", high="green") + 
    labs(title="Predicted vs. Actual Locale Frequency", y="Actual Locale", x="Predicted Locale")

Prediction accuracy dips from ~98% on the 2016 test data to ~92% on the 2014-2015 and 2017-2019 data from the same locales. This suggests that while the model is learning significant features that differentiate these locales, it is also learning features specific to 2016.

Chicago and Las Vegas are better classified in years other than 2016, retaining ~95% accuracy. Predictions are generally better in the summer months than in the winter months, and there is no meaningful difference in accuracy by hour.

Next steps are to train the model on a larger tranche of data (2015-2018) to see if this can be better generalized to out-of-sample years (2014, 2019) for the same locales:

fourLocales <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")

xgbFourLocalesCV_20152018 <- xgbRunModel(filter(metarData, !is.na(TempF)), 
                                         depVar="locNamefct", 
                                         predVars=locXGBPreds, 
                                         otherVars=keepVarFull, 
                                         critFilter=list(year=2015:2018, locNamefct=fourLocales), 
                                         seed=2008091301,
                                         nrounds=1000,
                                         print_every_n=50,
                                         xgbObjective="multi:softprob", 
                                         funcRun=xgboost::xgb.cv,
                                         nfold=5,
                                         calcErr=FALSE,
                                         num_class=length(fourLocales)
                                         )
## [1]  train-merror:0.213119+0.002119  test-merror:0.217464+0.003057 
## [51] train-merror:0.052677+0.000904  test-merror:0.069775+0.001018 
## [101]    train-merror:0.020469+0.000549  test-merror:0.041317+0.002381 
## [151]    train-merror:0.008479+0.000429  test-merror:0.028643+0.001862 
## [201]    train-merror:0.003949+0.000112  test-merror:0.024088+0.001306 
## [251]    train-merror:0.001940+0.000157  test-merror:0.021610+0.001605 
## [301]    train-merror:0.000934+0.000068  test-merror:0.020177+0.001172 
## [351]    train-merror:0.000422+0.000032  test-merror:0.019133+0.001343 
## [401]    train-merror:0.000207+0.000028  test-merror:0.018519+0.001087 
## [451]    train-merror:0.000092+0.000027  test-merror:0.018048+0.001045 
## [501]    train-merror:0.000031+0.000017  test-merror:0.017751+0.000815 
## [551]    train-merror:0.000013+0.000012  test-merror:0.017679+0.000737 
## [601]    train-merror:0.000008+0.000010  test-merror:0.017485+0.000653 
## [651]    train-merror:0.000000+0.000000  test-merror:0.017270+0.000610 
## [701]    train-merror:0.000000+0.000000  test-merror:0.017311+0.000621 
## [751]    train-merror:0.000000+0.000000  test-merror:0.017209+0.000785 
## [801]    train-merror:0.000000+0.000000  test-merror:0.017127+0.000729 
## [851]    train-merror:0.000000+0.000000  test-merror:0.017086+0.000828 
## [901]    train-merror:0.000000+0.000000  test-merror:0.017055+0.000811 
## [951]    train-merror:0.000000+0.000000  test-merror:0.017086+0.000821 
## [1000]   train-merror:0.000000+0.000000  test-merror:0.016911+0.000791

Results of the cross-validation can be assessed:

minTestError_20152018 <- xgbFourLocalesCV_20152018$xgbModel$evaluation_log %>%
    pull(test_merror_mean) %>%
    min()

xgbFourLocalesCV_20152018$xgbModel$evaluation_log %>%
    filter(test_merror_mean <= 0.08) %>%
    ggplot(aes(x=iter, y=test_merror_mean)) +
    geom_line() +
    labs(x="# Iterations", 
         y="Test Error", 
         title="Evolution of Test Error", 
         subtitle="Filtered to Test Error <= 0.08"
         ) + 
    ylim(c(0, NA)) + 
    geom_vline(aes(xintercept=0), lty=2) + 
    geom_hline(aes(yintercept=minTestError_20152018), color="red", lty=2)

The model using 2015-2018 data appears to drive slightly lower test RMSE than the 2016-only modeling. Test RMSE evolution appears to be stable by around 1000 iterations. The model is run for 1000 rounds:

fourLocales <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")

xgbFourLocales_20152018 <- xgbRunModel(filter(metarData, !is.na(TempF)), 
                                       depVar="locNamefct", 
                                       predVars=locXGBPreds, 
                                       otherVars=keepVarFull, 
                                       critFilter=list(year=2015:2018, locNamefct=fourLocales), 
                                       seed=2008081315,
                                       nrounds=1000,
                                       print_every_n=50,
                                       xgbObjective="multi:softprob", 
                                       funcRun=xgboost::xgboost,
                                       calcErr=FALSE,
                                       num_class=length(fourLocales)
                                       )
## [1]  train-merror:0.210493 
## [51] train-merror:0.051513 
## [101]    train-merror:0.020843 
## [151]    train-merror:0.008927 
## [201]    train-merror:0.004842 
## [251]    train-merror:0.002518 
## [301]    train-merror:0.001351 
## [351]    train-merror:0.000758 
## [401]    train-merror:0.000369 
## [451]    train-merror:0.000174 
## [501]    train-merror:0.000082 
## [551]    train-merror:0.000051 
## [601]    train-merror:0.000020 
## [651]    train-merror:0.000000 
## [701]    train-merror:0.000000 
## [751]    train-merror:0.000000 
## [801]    train-merror:0.000000 
## [851]    train-merror:0.000000 
## [901]    train-merror:0.000000 
## [951]    train-merror:0.000000 
## [1000]   train-merror:0.000000
## Warning: `as_tibble.matrix()` requires a matrix with column names or a `.name_repair` argument. Using compatibility `.name_repair`.
## This warning is displayed once per session.

Model performance can be evaluated:

# Overall success
xgbFourLocales_20152018$testData %>%
    summarize(mean(locNamefct==predicted))
## # A tibble: 1 x 1
##   `mean(locNamefct == predicted)`
##                             <dbl>
## 1                           0.984
# Histogram of predicted probabilities for selected class
xgbFourLocales_20152018$testData %>%
    mutate(correct=locNamefct==predicted) %>%
    ggplot() + 
    stat_count(aes(x=round(probPredicted, 2), y=..prop.., group=correct)) + 
    facet_wrap(~correct) + 
    labs(x="Probability Given to Prediction", 
         y="Proportion of Predictions", 
         title="Probability of Prediction vs. Accuracy of Prediction"
         )

# Confusion matrix
xgbFourLocales_20152018$testData %>%
    count(locNamefct, predicted) %>%
    group_by(locNamefct) %>%
    mutate(pct=n/sum(n)) %>%
    ggplot(aes(x=predicted, y=locNamefct)) + 
    geom_tile(aes(fill=pct)) + 
    geom_text(aes(label=paste0(round(100*pct), "%"))) + 
    scale_fill_continuous("", low="white", high="green") + 
    labs(title="Predicted vs. Actual Locale Frequency", y="Actual Locale", x="Predicted Locale")

# Find and plot importances
xgb_fourLocales_importance_20152018 <- plotXGBImportance(xgbFourLocales_20152018, 
                                                         featureStems=locXGBPreds, 
                                                         stemMapper = varMapper, 
                                                         plotTitle="Gain by variable in xgboost", 
                                                         plotSubtitle="2015-2018 Locale (LAS, MSY, ORD, SAN)"
                                                         )
## # A tibble: 10 x 4
##    Feature         Gain  Cover Frequency
##    <chr>          <dbl>  <dbl>     <dbl>
##  1 month         0.114  0.152     0.0903
##  2 hrfct         0.0121 0.0894    0.0658
##  3 TempF         0.205  0.127     0.137 
##  4 DewF          0.296  0.121     0.137 
##  5 modSLP        0.128  0.182     0.190 
##  6 Altimeter     0.0825 0.149     0.146 
##  7 WindSpeed     0.0488 0.0465    0.0958
##  8 predomDir     0.0527 0.0644    0.0658
##  9 minHeight     0.0497 0.0494    0.0425
## 10 ceilingHeight 0.0112 0.0191    0.0295

Accuracy and variable importances seem comparable to the previous model run using only 2016 data. Performance by dimension and the confusion matrix can be evaluated:

# Extract data for the four locales, excluding 2016
non20152018FourLocalesData <- metarData %>%
    filter(!is.na(TempF), !(year %in% 2015:2018), locNamefct %in% fourLocales)

# Make the predictions
non20152018FourLocalesPred <- non20152018FourLocalesData %>%
    mutate_if(is.factor, .funs=fct_drop) %>%
    helperMakeSparse(predVars=locXGBPreds) %>%
    predict(xgbFourLocales_20152018$xgbModel, newdata=.)

# Create the prediction matrix
non20152018FourLocalesMatrix <- matrix(data=non20152018FourLocalesPred, 
                                       ncol=length(fourLocales), 
                                       nrow=nrow(non20152018FourLocalesData), 
                                       byrow=TRUE
                                       )

# Get the predictions and probabilities, and add them to non2016FourLocalesData
maxCol <- apply(non20152018FourLocalesMatrix, 1, FUN=which.max)
non20152018FourLocalesData <- non20152018FourLocalesData %>%
    mutate(predicted=xgbFourLocales_20152018$yTrainLevels[maxCol], 
           probPredicted=apply(non20152018FourLocalesMatrix, 1, FUN=max), 
           correct=locNamefct==predicted
           )

# Assess overall prediction accuracy by year
helperNon2016Accuracy <- function(df, grpVar, grpLabel) {
    p1 <- df %>%
        group_by_at(grpVar) %>%
        summarize(pctCorrect=mean(correct)) %>%
        ggplot(aes(x=factor(get(grpVar)), y=pctCorrect)) + 
        geom_col(fill="lightblue") + 
        geom_text(aes(y=pctCorrect/2, label=paste0(round(100*pctCorrect), "%"))) +
        coord_flip() + 
        ylim(c(0, 1)) + 
        labs(x=grpLabel, y="Percent Correct", title="Accuracy of predictions to non-2016 data")
    print(p1)
}

helperNon2016Accuracy(non20152018FourLocalesData, grpVar="year", grpLabel="Year")

helperNon2016Accuracy(non20152018FourLocalesData, grpVar="locNamefct", grpLabel="Actual Locale")

helperNon2016Accuracy(non20152018FourLocalesData, grpVar="month", grpLabel="Month")

helperNon2016Accuracy(non20152018FourLocalesData, grpVar="hrfct", grpLabel="Hour (Zulu Time)")

# Confusion matrix
non20152018FourLocalesData %>%
    count(locNamefct, predicted) %>%
    group_by(locNamefct) %>%
    mutate(pct=n/sum(n)) %>%
    ggplot(aes(x=predicted, y=locNamefct)) + 
    geom_tile(aes(fill=pct)) + 
    geom_text(aes(label=paste0(round(100*pct), "%"))) + 
    scale_fill_continuous("", low="white", high="green") + 
    labs(title="Predicted vs. Actual Locale Frequency", y="Actual Locale", x="Predicted Locale")

Prediction quality for out-of-sample years increases meaningfully, from ~90% when the model is based only on 2016 data to ~95% when the model is based on 2015-2018 data. This suggests that modeling on multiple years helps train the model on recurring climate differences as opposed to sporadic local weather anomalies in a given year. chicago appears to almost always be predicted as itself, though each other locale predicts as Chicago on occasion.

Data for all other locales is then run through the model to see which of the four cities each of the other cities is “most like”:

# Extract data for the four locales, excluding 2016
otherLocalesData <- metarData %>%
    filter(!is.na(TempF), year == 2016, !(locNamefct %in% fourLocales))

# Make the predictions
otherLocalesPred <- otherLocalesData %>%
    mutate_if(is.factor, .funs=fct_drop) %>%
    helperMakeSparse(predVars=locXGBPreds) %>%
    predict(xgbFourLocales_20152018$xgbModel, newdata=.)

# Create the prediction matrix
otherLocalesMatrix <- matrix(data=otherLocalesPred, 
                                   ncol=length(fourLocales), 
                                   nrow=nrow(otherLocalesData), 
                                   byrow=TRUE
                                   )

# Get the predictions and probabilities, and add them to non2016FourLocalesData
maxCol <- apply(otherLocalesMatrix, 1, FUN=which.max)
otherLocalesData <- otherLocalesData %>%
    mutate(predicted=xgbFourLocales_20152018$yTrainLevels[maxCol], 
           probPredicted=apply(otherLocalesMatrix, 1, FUN=max), 
           correct=locNamefct==predicted
           )

# Confusion matrix
otherLocalesData %>%
    count(locNamefct, predicted) %>%
    group_by(locNamefct) %>%
    mutate(pct=n/sum(n), pctChi=ifelse(predicted=="Chicago, IL", pct, 0)) %>%
    ggplot(aes(x=predicted, y=fct_reorder(locNamefct, pctChi, .fun=max))) + 
    geom_tile(aes(fill=pct)) + 
    geom_text(aes(label=paste0(round(100*pct), "%"))) + 
    scale_fill_continuous("", low="white", high="green") + 
    labs(title="Predicted vs. Actual Locale Frequency", y="Actual Locale", x="Predicted Locale")

Findings include:

  • Cold weather cities tend to be 90%+ assigned to Chicago
  • Phoenix is highly associated with Las Vegas, while somewhat surprisingly both Denver and San Antonio align primarily with Las Vegas also
  • Houston, Tampa, and Miami align primarily with New Orleans, with some association also to San Diego
  • Coastal California cities tend to align to San Diego and Chicago, with the proportion of Chicago-like days higher in northern California than in Los Angeles

Next steps are to consolidate the functions for data preparation, modeling (including CV), and prediction for XGB such that the process is better aligned with functional programming.